Principal Component Analysis

Author

Mathieu Ribatet

📝 Part I: Reanalysis of the socio-economic dataset

To retrive the data you can go here.

  1. Start by having a glimpse at the data and perform a short descriptive analysis.
  2. To perform PCA analysis, we will use the FactoMineR library (and factoextra for visualization purposes). Installation (only done once of course) is done using
install.packages(c("FactoMineR", "factoextra)")

you are now ready to load them (to be done each time you restart R)

library(FactoMineR)
library(factoextra)
  1. Read the documentation of the PCA function (you can skip the col.w and row.w arguments since they were not cover during the course). Next perform a PCA on the socio-economic dataset.
  2. Show the evolution of the explained percentage variance as the number of factorial axis increases. How many axis will you use? The function fviz_screeplot may be useful.
  3. Plot the first factorial plane, i.e., plane define from the 1st and 2nd factorial axis and interpret. The function fviz_pca_var will be useful. Comment.
  4. Plot the contribution of the variables to the first two axes. Does it match your previous statements? The function fviz_contrib will be useful.
  5. Learn how to retrieve the principal components, i.e., coordinates of the individuals onto factorial axis using the get_pca_ind function. Retrieving these coordinates might be useful for subsequent analysis.
  6. Plot the individuals on the first factorial plane. Now plot the five individuals the best represented on the first factorial plane. Do the same with those with the largest contributions. The function fviz_pca_ind might be useful.
  7. Create your own new individual as well as new quantitative and qualitative variables. Learn how to use the ind.sup, quali.sup, quanti.sup arguments.
  8. Use the habillage and addEllipses argument of the fviz_pca_ind function and try to make sense of it.
  9. Use the selectargument to plot only individuals whose \(\cos^2\) is larger than 0.9. Plot only the individuals that have the largest top 5 contribution to the first factorial axis.
  10. Reinventing the wheel… I wrote an implementation of the PCA (using SVD). Some parts of the code are missing though. Complete the code.
mypca <- function(data){
  n.obs <- nrow(data)
  n.var <- ncol(data)
  data <- scale(data)## centering and scaling the data
  decomp <- svd(data)
  
  U <- decomp$u
  V <- decomp$v
  D <- diag(decomp$d)
  
  ## Compute the percentage of explained variance
  explained.variance.prop <- 1## Fill in
  
  ## Coordinates of individuals and variables onto factorial axis
  ind.coord <- 1## Fill in
  var.coord <- 1## Fill in
  
  ## Some graphics
  par(mfrow = c(1, 3))
  
  ## Evolution of the explained variance
  barplot(100 * explained.variance.prop)
  
  ## Plot individuals onto the 1st factorial plane
  xlab <- paste("1st axis (", 100 * round(explained.variance.prop[1], 3), "%)", sep = "")
  ylab <- paste("2nd axis (", 100 * round(explained.variance.prop[2], 3), "%)", sep = "")
  plot(ind.coord[,1:2], xlab = xlab, ylab = ylab, main = "Individuals")
  abline(h = 0, lty = 2, col = "grey")
  abline(v = 0, lty = 2, col = "grey")
  
  ## Plot the variable onto the 1st factorial plane
  plot(0, xlim = c(-1, 1), ylim = c(-1, 1), xlab = xlab, ylab = ylab, main = "Variables",
       type = "n")
  abline(h = 0, lty = 2, col = "grey")
  abline(v = 0, lty = 2, col = "grey")
  
  ## Draw the unit circle
  angles <- seq(0, 2 * pi, length = 500)
  lines(cos(angles), sin(angles))
  arrows(rep(0, ncol(data)), rep(0, ncol(data)), var.coord[,1], var.coord[,2])
  text(var.coord[,1], var.coord[,2], colnames(data))
  
  return(list(ind.coord = ind.coord, var.coord = var.coord, explained.variance.prop = explained.variance.prop))
}

📝 Part II: The French premier league

Data can be retrieved from here. Perform a complete principal component analysis and give interpretation.

Good luck!

Want more? For those who already have finished the Lab, you can perform a PCA on ranks, i.e., rather than using the raw data, you compute the rank for each variable and apply PCA on these ranks. This prevents large contributions of outliers in the PCA.