In this lab, we will see how to apply what we have learned about the ridge regression, lasso and elastic net. We will mainly use the glmnet package so please install it first.

1. Just to be sure to understand the math

In this section we will check if the maths related to regularized regression are properly understood. As always, we will fit a ridge regression, lasso and elastic net models on simulated data (so that the true model is perfectly known) from scratch and using the dedicated builtin R function.

a) Generate some artificial data

Let’s first simulate some data

set.seed(1)##fix the random seed so that we all have the same results
p <- 50
n <- 51
X <- matrix(rnorm(n * p), n)
beta.theo <- rep(0, p)
beta.theo[1:6] <- 2
Y.theo <- X %*% beta.theo
Y <- Y.theo + rnorm(n)

b) Analysis “from scratch”

  1. Try to understand every lines of the code below (which computes the penalized least-square)
penalizedleastsquare <- function(par, features, response, lambda, alpha = 0){
  ## par is the vector of parameters, i.e., beta_0, beta_1
  ## features is the data.frame storing features
  ## response is the observed response (binary response here)
  ## lambda is the overall penalty coefficient
  ## alpha is the weight to balance between ridge and lasso penalties (default -> ridge)
  
  return(0.5 * mean((response - features %*% par)^2) + lambda * (alpha * sum(abs(par)) + 0.5 * (1 - alpha) * sum(par^2)))
}
  1. Minimize the above function to get the ridge estimate when \(\lambda = 1\)
## Fill in this part
  1. Redo the previous step with different values for \(\lambda\) and plot the evolution of the parameter estimate as \(\lambda\) increases.
## Fill in this part
  1. Do the same with the lasso.
## Fill in this part
  1. Do the same with an elastic net with \(\alpha = 0.5\).
## Fill in this part

c) Analysis using the dedicated R function

  1. Read the help file of the glmnet function.
  2. Redo the previous fits with this function and compare your results
## Fill in this part
  1. Run the following code and comment
plot(fit)##fit is the output of the previous glm call, I'm here to explain each plot ;-)
plot(fit, xvar = "lambda", label = TRUE)
  1. What does the following pieces of code
coef(fit, s = 1)
cvfit <- cv.glmnet(X, Y)
plot(cvfit)
coef(cvfit, s = "lambda.min")
  1. Read (again) the documentation of the cv.glmnet function and do a 5-fold cross validation for tuning \(\lambda\).

2. Analyzing the Hitter dataset

The Hitter dataset stores 322 observation of major league Baseball players from the 1986 and 1987 season. The aim of this study is to be able to predict the Salary given some features. Here is a quick recap of the available feature and response.

You can get the data from here.

  1. Import the data and familiarize your self with this dataset.
  2. Fit a ridge regression model, plot the “Ridge path” and comment.
  3. Select the best model according to a 5-fold cross validation procedure.
  4. Redo the previous two steps with the lasso.
  5. According to your model which features appear to be significant in predicting the salary?

3. Going a bit further

Actually the glmnet function can do much more than ridge, lasso and elastic net regression. For instance we can do a lasso logistic regression by passing the argument family = binomial to the glmnet function. Reanalyze the Default dataset we previously model with a penalized logistic regression.

LS0tCnRpdGxlOiAiUmVndWxhcml6ZWQgcmVncmVzc2lvbiIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKSW4gdGhpcyBsYWIsIHdlIHdpbGwgc2VlIGhvdyB0byBhcHBseSB3aGF0IHdlIGhhdmUgbGVhcm5lZCBhYm91dCB0aGUgcmlkZ2UgcmVncmVzc2lvbiwgbGFzc28gYW5kIGVsYXN0aWMgbmV0LiBXZSB3aWxsIG1haW5seSB1c2UgdGhlICpnbG1uZXQqIHBhY2thZ2Ugc28gcGxlYXNlIGluc3RhbGwgaXQgZmlyc3QuCgojIyAxLiBKdXN0IHRvIGJlIHN1cmUgdG8gdW5kZXJzdGFuZCB0aGUgbWF0aAoKSW4gdGhpcyBzZWN0aW9uIHdlIHdpbGwgY2hlY2sgaWYgdGhlIG1hdGhzIHJlbGF0ZWQgdG8gcmVndWxhcml6ZWQgcmVncmVzc2lvbiBhcmUgcHJvcGVybHkgdW5kZXJzdG9vZC4gQXMgYWx3YXlzLCB3ZSB3aWxsIGZpdCBhIHJpZGdlIHJlZ3Jlc3Npb24sIGxhc3NvIGFuZCBlbGFzdGljIG5ldCBtb2RlbHMgb24gc2ltdWxhdGVkIGRhdGEgKHNvIHRoYXQgdGhlIHRydWUgbW9kZWwgaXMgcGVyZmVjdGx5IGtub3duKSBmcm9tIHNjcmF0Y2ggYW5kIHVzaW5nIHRoZSBkZWRpY2F0ZWQgYnVpbHRpbiAqUiogZnVuY3Rpb24uCgojIyMgYSkgR2VuZXJhdGUgc29tZSBhcnRpZmljaWFsIGRhdGEKTGV0J3MgZmlyc3Qgc2ltdWxhdGUgc29tZSBkYXRhCmBgYHtyfQpzZXQuc2VlZCgxKSMjZml4IHRoZSByYW5kb20gc2VlZCBzbyB0aGF0IHdlIGFsbCBoYXZlIHRoZSBzYW1lIHJlc3VsdHMKcCA8LSA1MApuIDwtIDUxClggPC0gbWF0cml4KHJub3JtKG4gKiBwKSwgbikKYmV0YS50aGVvIDwtIHJlcCgwLCBwKQpiZXRhLnRoZW9bMTo2XSA8LSAyClkudGhlbyA8LSBYICUqJSBiZXRhLnRoZW8KWSA8LSBZLnRoZW8gKyBybm9ybShuKQpgYGAKCiMjIyBiKSBBbmFseXNpcyAiZnJvbSBzY3JhdGNoIgoKMS4gVHJ5IHRvIHVuZGVyc3RhbmQgZXZlcnkgbGluZXMgb2YgdGhlIGNvZGUgYmVsb3cgKHdoaWNoIGNvbXB1dGVzIHRoZSBwZW5hbGl6ZWQgbGVhc3Qtc3F1YXJlKQpgYGB7cn0KcGVuYWxpemVkbGVhc3RzcXVhcmUgPC0gZnVuY3Rpb24ocGFyLCBmZWF0dXJlcywgcmVzcG9uc2UsIGxhbWJkYSwgYWxwaGEgPSAwKXsKICAjIyBwYXIgaXMgdGhlIHZlY3RvciBvZiBwYXJhbWV0ZXJzLCBpLmUuLCBiZXRhXzAsIGJldGFfMQogICMjIGZlYXR1cmVzIGlzIHRoZSBkYXRhLmZyYW1lIHN0b3JpbmcgZmVhdHVyZXMKICAjIyByZXNwb25zZSBpcyB0aGUgb2JzZXJ2ZWQgcmVzcG9uc2UgKGJpbmFyeSByZXNwb25zZSBoZXJlKQogICMjIGxhbWJkYSBpcyB0aGUgb3ZlcmFsbCBwZW5hbHR5IGNvZWZmaWNpZW50CiAgIyMgYWxwaGEgaXMgdGhlIHdlaWdodCB0byBiYWxhbmNlIGJldHdlZW4gcmlkZ2UgYW5kIGxhc3NvIHBlbmFsdGllcyAoZGVmYXVsdCAtPiByaWRnZSkKICAKICByZXR1cm4oMC41ICogbWVhbigocmVzcG9uc2UgLSBmZWF0dXJlcyAlKiUgcGFyKV4yKSArIGxhbWJkYSAqIChhbHBoYSAqIHN1bShhYnMocGFyKSkgKyAwLjUgKiAoMSAtIGFscGhhKSAqIHN1bShwYXJeMikpKQp9CmBgYAoKMi4gTWluaW1pemUgdGhlIGFib3ZlIGZ1bmN0aW9uIHRvIGdldCB0aGUgcmlkZ2UgZXN0aW1hdGUgd2hlbiAkXGxhbWJkYSA9IDEkCmBgYHtyfQojIyBGaWxsIGluIHRoaXMgcGFydApgYGAKCjMuIFJlZG8gdGhlIHByZXZpb3VzIHN0ZXAgd2l0aCBkaWZmZXJlbnQgdmFsdWVzIGZvciAkXGxhbWJkYSQgYW5kIHBsb3QgdGhlIGV2b2x1dGlvbiBvZiB0aGUgcGFyYW1ldGVyIGVzdGltYXRlIGFzICRcbGFtYmRhJCBpbmNyZWFzZXMuCmBgYHtyfQojIyBGaWxsIGluIHRoaXMgcGFydApgYGAKCjQuIERvIHRoZSBzYW1lIHdpdGggdGhlIGxhc3NvLgpgYGB7cn0KIyMgRmlsbCBpbiB0aGlzIHBhcnQKYGBgCgo1LiBEbyB0aGUgc2FtZSB3aXRoIGFuIGVsYXN0aWMgbmV0IHdpdGggJFxhbHBoYSA9IDAuNSQuCmBgYHtyfQojIyBGaWxsIGluIHRoaXMgcGFydApgYGAKCiMjIyBjKSBBbmFseXNpcyB1c2luZyB0aGUgZGVkaWNhdGVkICpSKiBmdW5jdGlvbgoKMS4gUmVhZCB0aGUgaGVscCBmaWxlIG9mIHRoZSAqZ2xtbmV0KiBmdW5jdGlvbi4KMi4gUmVkbyB0aGUgcHJldmlvdXMgZml0cyB3aXRoIHRoaXMgZnVuY3Rpb24gYW5kIGNvbXBhcmUgeW91ciByZXN1bHRzCmBgYHtyfQojIyBGaWxsIGluIHRoaXMgcGFydApgYGAKCjQuIFJ1biB0aGUgZm9sbG93aW5nIGNvZGUgYW5kIGNvbW1lbnQKYGBge3IgZXZhbD1GQUxTRX0KcGxvdChmaXQpIyNmaXQgaXMgdGhlIG91dHB1dCBvZiB0aGUgcHJldmlvdXMgZ2xtIGNhbGwsIEknbSBoZXJlIHRvIGV4cGxhaW4gZWFjaCBwbG90IDstKQpwbG90KGZpdCwgeHZhciA9ICJsYW1iZGEiLCBsYWJlbCA9IFRSVUUpCmBgYAoKNS4gV2hhdCBkb2VzIHRoZSBmb2xsb3dpbmcgcGllY2VzIG9mIGNvZGUKYGBge3IgZXZhbD1GQUxTRX0KY29lZihmaXQsIHMgPSAxKQpjdmZpdCA8LSBjdi5nbG1uZXQoWCwgWSkKcGxvdChjdmZpdCkKY29lZihjdmZpdCwgcyA9ICJsYW1iZGEubWluIikKYGBgCgo2LiBSZWFkIChhZ2FpbikgdGhlIGRvY3VtZW50YXRpb24gb2YgdGhlICpjdi5nbG1uZXQqIGZ1bmN0aW9uIGFuZCBkbyBhIDUtZm9sZCBjcm9zcyB2YWxpZGF0aW9uIGZvciB0dW5pbmcgJFxsYW1iZGEkLgoKIyMgMi4gQW5hbHl6aW5nIHRoZSAqSGl0dGVyKiBkYXRhc2V0CgpUaGUgKkhpdHRlciogZGF0YXNldCBzdG9yZXMgMzIyIG9ic2VydmF0aW9uIG9mIG1ham9yIGxlYWd1ZSBCYXNlYmFsbCBwbGF5ZXJzIGZyb20gdGhlIDE5ODYgYW5kIDE5ODcgc2Vhc29uLiBUaGUgYWltIG9mIHRoaXMgc3R1ZHkgaXMgdG8gYmUgYWJsZSB0byBwcmVkaWN0IHRoZSAqU2FsYXJ5KiBnaXZlbiBzb21lIGZlYXR1cmVzLiBIZXJlIGlzIGEgcXVpY2sgcmVjYXAgb2YgdGhlIGF2YWlsYWJsZSBmZWF0dXJlIGFuZCByZXNwb25zZS4KCiAgLSBBdEJhdCBOdW1iZXIgb2YgdGltZXMgYXQgYmF0IGluIDE5ODYKICAtIEhpdHMgTnVtYmVyIG9mIGhpdHMgaW4gMTk4NgogIC0gSG1SdW4gTnVtYmVyIG9mIGhvbWUgcnVucyBpbiAxOTg2CiAgLSBSdW5zIE51bWJlciBvZiBydW5zIGluIDE5ODYKICAtIFJCSSBOdW1iZXIgb2YgcnVucyBiYXR0ZWQgaW4gMTk4NgogIC0gV2Fsa3MgTnVtYmVyIG9mIHdhbGtzIGluIDE5ODYKICAtIFllYXJzIE51bWJlciBvZiB5ZWFycyBpbiB0aGUgbWFqb3IgbGVhZ3VlcwogIC0gQ0F0QmF0IE51bWJlciBvZiB0aW1lcyBhdCBiYXQgZHVyaW5nIGhpcyBjYXJlZXIKICAtIENIaXRzIE51bWJlciBvZiBoaXRzIGR1cmluZyBoaXMgY2FyZWVyCiAgLSBDSG1SdW4gTnVtYmVyIG9mIGhvbWUgcnVucyBkdXJpbmcgaGlzIGNhcmVlcgogIC0gQ1J1bnMgTnVtYmVyIG9mIHJ1bnMgZHVyaW5nIGhpcyBjYXJlZXIKICAtIENSQkkgTnVtYmVyIG9mIHJ1bnMgYmF0dGVkIGluIGR1cmluZyBoaXMgY2FyZWVyCiAgLSBDV2Fsa3MgTnVtYmVyIG9mIHdhbGtzIGR1cmluZyBoaXMgY2FyZWVyCiAgLSBMZWFndWUgQSBmYWN0b3Igd2l0aCBsZXZlbHMgQSBhbmQgTiBpbmRpY2F0aW5nIHBsYXllcuKAmXMgbGVhZ3VlIGF0IHRoZSBlbmQgb2YgMTk4NiBEaXZpc2lvbiBBIGZhY3RvciB3aXRoIGxldmVscyBFIGFuZCBXIGluZGljYXRpbmcgcGxheWVy4oCZcyBkaXZpc2lvbiBhdCB0aGUgZW5kIG9mIDE5ODYKICAtIFB1dE91dHMgTnVtYmVyIG9mIHB1dCBvdXRzIGluIDE5ODYKICAtIEFzc2lzdHMgTnVtYmVyIG9mIGFzc2lzdHMgaW4gMTk4NgogIC0gRXJyb3JzIE51bWJlciBvZiBlcnJvcnMgaW4gMTk4NgogIC0gU2FsYXJ5IDE5ODcgYW5udWFsIHNhbGFyeSBvbiBvcGVuaW5nIGRheSBpbiB0aG91c2FuZHMgb2YgZG9sbGFycwogIC0gTmV3TGVhZ3VlIEEgZmFjdG9yIHdpdGggbGV2ZWxzIEEgYW5kIE4gaW5kaWNhdGluZyBwbGF5ZXLigJlzIGxlYWd1ZSBhdCB0aGUgYmVnaW5uaW5nIG9mIDE5ODcKICAKWW91IGNhbiBnZXQgdGhlIGRhdGEgZnJvbSBbaGVyZV0oaHR0cDovL21yaWJhdGV0LnBlcnNvLm1hdGguY25ycy5mci9DZW50cmFsZU5hbnRlcy9GYXN0VHJhY2svSGl0dGVycy5jc3YpLiAKCjEuIEltcG9ydCB0aGUgZGF0YSBhbmQgZmFtaWxpYXJpemUgeW91ciBzZWxmIHdpdGggdGhpcyBkYXRhc2V0LgoyLiBGaXQgYSByaWRnZSByZWdyZXNzaW9uIG1vZGVsLCBwbG90IHRoZSAiUmlkZ2UgcGF0aCIgYW5kIGNvbW1lbnQuCjMuIFNlbGVjdCB0aGUgYmVzdCBtb2RlbCBhY2NvcmRpbmcgdG8gYSA1LWZvbGQgY3Jvc3MgdmFsaWRhdGlvbiBwcm9jZWR1cmUuCjQuIFJlZG8gdGhlIHByZXZpb3VzIHR3byBzdGVwcyB3aXRoIHRoZSBsYXNzby4KNS4gQWNjb3JkaW5nIHRvIHlvdXIgbW9kZWwgd2hpY2ggZmVhdHVyZXMgYXBwZWFyIHRvIGJlIHNpZ25pZmljYW50IGluIHByZWRpY3RpbmcgdGhlIHNhbGFyeT8KCiMjIDMuIEdvaW5nIGEgYml0IGZ1cnRoZXIKCkFjdHVhbGx5IHRoZSAqZ2xtbmV0KiBmdW5jdGlvbiBjYW4gZG8gbXVjaCBtb3JlIHRoYW4gcmlkZ2UsIGxhc3NvIGFuZCBlbGFzdGljIG5ldCByZWdyZXNzaW9uLiBGb3IgaW5zdGFuY2Ugd2UgY2FuIGRvIGEgbGFzc28gbG9naXN0aWMgcmVncmVzc2lvbiAgYnkgcGFzc2luZyB0aGUgYXJndW1lbnQgKmZhbWlseSA9IGJpbm9taWFsKiB0byB0aGUgKmdsbW5ldCogZnVuY3Rpb24uIFJlYW5hbHl6ZSB0aGUgKkRlZmF1bHQqIGRhdGFzZXQgd2UgcHJldmlvdXNseSBtb2RlbCB3aXRoIGEgcGVuYWxpemVkIGxvZ2lzdGljIHJlZ3Jlc3Npb24u