Introduction

In this lab we will use PCA to analize football players characteristics of the French premier league (according to the video game Fifa). But before doing so, we will try to reproduce the results we get during the lectures on our socio-economic dataset.

(Re) Analysis of the socio-economic dataset

To retrive the data you can go here.

  1. Start by having a glimpse at the data and perform a short descriptive analysis.
  2. To perform PCA analysis, we will use the FactoMineR library (and factoextra for visualization purposes). Installation (only done once of course) is done using
install.packages(c("FactoMineR", "factoextra)")

you are now ready to load them (to be done each time you restart R)

library(FactoMineR)
library(factoextra)
  1. Read the documentation of the PCA function (you can skip the col.w and row.w arguments since they were not cover during the course). Next perform a PCA on the socio-economic dataset.
  2. Show the evolution of the explained percentage variance as the number of factorial axis increases. How many axis will you use? The function fviz_screeplot may be useful.
  3. Plot the first factorial plane, i.e., plane define from the 1st and 2nd factorial axis and interpret. The function fviz_pca_var will be useful.
  4. Plot the contribution of the variables to the first two axes. Does it match your previous statements? The function fviz_contrib will be useful.
  5. Learn how to retrieve the principal components, i.e., coordinates of the individuals onto factorial axis using the * get_pca_ind* function. Retrieving these coordinates might be useful for subsequent analysis.
  6. Plot the individuals on the first factorial plane. Now plot the five individuals the best represented on the first factorial plane. Do the same with those with the largest contributions. The function fviz_pca_ind might be useful.
  7. Create your own new individual as well as new quantitative and qualitative variables. Learn how to use the ind.sup, quali.sup, quanti.sup arguments.
  8. Use the habillage and addEllipses argument of the *fviz_pca_ind$ function and try to make sense of it.
  9. Option but for those who would like to make connection with the maths. I wrote an implementation of the PCA (using SVD). Some parts of the code are missing though. Complete the code.
mypca <- function(data){
  n.obs <- nrow(data)
  n.var <- ncol(data)
  data <- scale(data)## centering and scaling the data
  decomp <- svd(data)
  
  U <- decomp$u
  V <- decomp$v
  D <- diag(decomp$d)
  
  ## Compute the percentage of explained variance
  explained.variance.prop <- 1## Fill in
  
  ## Coordinates of individuals and variables onto factorial axis
  ind.coord <- 1## Fill in
  var.coord <- 1## Fill in
  
  ## Some graphics
  par(mfrow = c(1, 3))
  
  ## Evolution of the explained variance
  barplot(100 * explained.variance.prop)
  
  ## Plot individuals onto the 1st factorial plane
  xlab <- paste("1st axis (", 100 * round(explained.variance.prop[1], 3), "%)", sep = "")
  ylab <- paste("2nd axis (", 100 * round(explained.variance.prop[2], 3), "%)", sep = "")
  plot(ind.coord[,1:2], xlab = xlab, ylab = ylab, main = "Individuals")
  abline(h = 0, lty = 2, col = "grey")
  abline(v = 0, lty = 2, col = "grey")
  
  ## Plot the variable onto the 1st factorial plane
  plot(0, xlim = c(-1, 1), ylim = c(-1, 1), xlab = xlab, ylab = ylab, main = "Variables",
       type = "n")
  abline(h = 0, lty = 2, col = "grey")
  abline(v = 0, lty = 2, col = "grey")
  
  ## on trace le cercle unité
  angles <- seq(0, 2 * pi, length = 500)
  lines(cos(angles), sin(angles))
  arrows(rep(0, ncol(data)), rep(0, ncol(data)), var.coord[,1], var.coord[,2])
  text(var.coord[,1], var.coord[,2], colnames(data))
  
  return(list(ind.coord = ind.coord, var.coord = var.coord, explained.variance.prop = explained.variance.prop))
}

Application: The French premier league

Data can be retrieved from here. Perform a complete statistical analysis.

Good luck!