In this lab, we will see how to apply what we have learned about the ridge regression, lasso and elastic net. We will mainly use the glmnet package so please install it first.
1. Just to be sure to understand the math
In this section we will check if the maths related to regularized regression are properly understood. As always, we will fit a ridge regression, lasso and elastic net models on simulated data (so that the true model is perfectly known) from scratch and using the dedicated builtin R function.
a) Generate some artificial data
Let’s first simulate some data
set.seed(1)##fix the random seed so that we all have the same results
p <- 50
n <- 51
X <- matrix(rnorm(n * p), n)
beta.theo <- rep(0, p)
beta.theo[1:6] <- 2
Y.theo <- X %*% beta.theo
Y <- Y.theo + rnorm(n)
b) Analysis “from scratch”
- Try to understand every lines of the code below (which computes the penalized least-square)
penalizedleastsquare <- function(par, features, response, lambda, alpha = 0){
## par is the vector of parameters, i.e., beta_0, beta_1
## features is the data.frame storing features
## response is the observed response (binary response here)
## lambda is the overall penalty coefficient
## alpha is the weight to balance between ridge and lasso penalties (default -> ridge)
return(0.5 * mean((response - features %*% par)^2) + lambda * (alpha * sum(abs(par)) + 0.5 * (1 - alpha) * sum(par^2)))
}
- Minimize the above function to get the ridge estimate when \(\lambda = 1\)
## Fill in this part
- Redo the previous step with different values for \(\lambda\) and plot the evolution of the parameter estimate as \(\lambda\) increases.
## Fill in this part
- Do the same with the lasso.
## Fill in this part
- Do the same with an elastic net with \(\alpha = 0.5\).
## Fill in this part
c) Analysis using the dedicated R function
- Read the help file of the glmnet function.
- Redo the previous fits with this function and compare your results
## Fill in this part
- Run the following code and comment
plot(fit)##fit is the output of the previous glm call, I'm here to explain each plot ;-)
plot(fit, xvar = "lambda", label = TRUE)
- What does the following pieces of code
coef(fit, s = 1)
cvfit <- cv.glmnet(X, Y)
plot(cvfit)
coef(cvfit, s = "lambda.min")
- Read (again) the documentation of the cv.glmnet function and do a 5-fold cross validation for tuning \(\lambda\).
2. Analyzing the Hitter dataset
The Hitter dataset stores 322 observation of major league Baseball players from the 1986 and 1987 season. The aim of this study is to be able to predict the Salary given some features. Here is a quick recap of the available feature and response.
- AtBat Number of times at bat in 1986
- Hits Number of hits in 1986
- HmRun Number of home runs in 1986
- Runs Number of runs in 1986
- RBI Number of runs batted in 1986
- Walks Number of walks in 1986
- Years Number of years in the major leagues
- CAtBat Number of times at bat during his career
- CHits Number of hits during his career
- CHmRun Number of home runs during his career
- CRuns Number of runs during his career
- CRBI Number of runs batted in during his career
- CWalks Number of walks during his career
- League A factor with levels A and N indicating player’s league at the end of 1986 Division A factor with levels E and W indicating player’s division at the end of 1986
- PutOuts Number of put outs in 1986
- Assists Number of assists in 1986
- Errors Number of errors in 1986
- Salary 1987 annual salary on opening day in thousands of dollars
- NewLeague A factor with levels A and N indicating player’s league at the beginning of 1987
You can get the data from here.
- Import the data and familiarize your self with this dataset.
- Fit a ridge regression model, plot the “Ridge path” and comment.
- Select the best model according to a 5-fold cross validation procedure.
- Redo the previous two steps with the lasso.
- According to your model which features appear to be significant in predicting the salary?
3. Going a bit further
Actually the glmnet function can do much more than ridge, lasso and elastic net regression. For instance we can do a lasso logistic regression by passing the argument family = binomial to the glmnet function. Reanalyze the Default dataset we previously model with a penalized logistic regression.
LS0tCnRpdGxlOiAiUmVndWxhcml6ZWQgcmVncmVzc2lvbiIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKSW4gdGhpcyBsYWIsIHdlIHdpbGwgc2VlIGhvdyB0byBhcHBseSB3aGF0IHdlIGhhdmUgbGVhcm5lZCBhYm91dCB0aGUgcmlkZ2UgcmVncmVzc2lvbiwgbGFzc28gYW5kIGVsYXN0aWMgbmV0LiBXZSB3aWxsIG1haW5seSB1c2UgdGhlICpnbG1uZXQqIHBhY2thZ2Ugc28gcGxlYXNlIGluc3RhbGwgaXQgZmlyc3QuCgojIyAxLiBKdXN0IHRvIGJlIHN1cmUgdG8gdW5kZXJzdGFuZCB0aGUgbWF0aAoKSW4gdGhpcyBzZWN0aW9uIHdlIHdpbGwgY2hlY2sgaWYgdGhlIG1hdGhzIHJlbGF0ZWQgdG8gcmVndWxhcml6ZWQgcmVncmVzc2lvbiBhcmUgcHJvcGVybHkgdW5kZXJzdG9vZC4gQXMgYWx3YXlzLCB3ZSB3aWxsIGZpdCBhIHJpZGdlIHJlZ3Jlc3Npb24sIGxhc3NvIGFuZCBlbGFzdGljIG5ldCBtb2RlbHMgb24gc2ltdWxhdGVkIGRhdGEgKHNvIHRoYXQgdGhlIHRydWUgbW9kZWwgaXMgcGVyZmVjdGx5IGtub3duKSBmcm9tIHNjcmF0Y2ggYW5kIHVzaW5nIHRoZSBkZWRpY2F0ZWQgYnVpbHRpbiAqUiogZnVuY3Rpb24uCgojIyMgYSkgR2VuZXJhdGUgc29tZSBhcnRpZmljaWFsIGRhdGEKTGV0J3MgZmlyc3Qgc2ltdWxhdGUgc29tZSBkYXRhCmBgYHtyfQpzZXQuc2VlZCgxKSMjZml4IHRoZSByYW5kb20gc2VlZCBzbyB0aGF0IHdlIGFsbCBoYXZlIHRoZSBzYW1lIHJlc3VsdHMKcCA8LSA1MApuIDwtIDUxClggPC0gbWF0cml4KHJub3JtKG4gKiBwKSwgbikKYmV0YS50aGVvIDwtIHJlcCgwLCBwKQpiZXRhLnRoZW9bMTo2XSA8LSAyClkudGhlbyA8LSBYICUqJSBiZXRhLnRoZW8KWSA8LSBZLnRoZW8gKyBybm9ybShuKQpgYGAKCiMjIyBiKSBBbmFseXNpcyAiZnJvbSBzY3JhdGNoIgoKMS4gVHJ5IHRvIHVuZGVyc3RhbmQgZXZlcnkgbGluZXMgb2YgdGhlIGNvZGUgYmVsb3cgKHdoaWNoIGNvbXB1dGVzIHRoZSBwZW5hbGl6ZWQgbGVhc3Qtc3F1YXJlKQpgYGB7cn0KcGVuYWxpemVkbGVhc3RzcXVhcmUgPC0gZnVuY3Rpb24ocGFyLCBmZWF0dXJlcywgcmVzcG9uc2UsIGxhbWJkYSwgYWxwaGEgPSAwKXsKICAjIyBwYXIgaXMgdGhlIHZlY3RvciBvZiBwYXJhbWV0ZXJzLCBpLmUuLCBiZXRhXzAsIGJldGFfMQogICMjIGZlYXR1cmVzIGlzIHRoZSBkYXRhLmZyYW1lIHN0b3JpbmcgZmVhdHVyZXMKICAjIyByZXNwb25zZSBpcyB0aGUgb2JzZXJ2ZWQgcmVzcG9uc2UgKGJpbmFyeSByZXNwb25zZSBoZXJlKQogICMjIGxhbWJkYSBpcyB0aGUgb3ZlcmFsbCBwZW5hbHR5IGNvZWZmaWNpZW50CiAgIyMgYWxwaGEgaXMgdGhlIHdlaWdodCB0byBiYWxhbmNlIGJldHdlZW4gcmlkZ2UgYW5kIGxhc3NvIHBlbmFsdGllcyAoZGVmYXVsdCAtPiByaWRnZSkKICAKICByZXR1cm4oMC41ICogbWVhbigocmVzcG9uc2UgLSBmZWF0dXJlcyAlKiUgcGFyKV4yKSArIGxhbWJkYSAqIChhbHBoYSAqIHN1bShhYnMocGFyKSkgKyAwLjUgKiAoMSAtIGFscGhhKSAqIHN1bShwYXJeMikpKQp9CmBgYAoKMi4gTWluaW1pemUgdGhlIGFib3ZlIGZ1bmN0aW9uIHRvIGdldCB0aGUgcmlkZ2UgZXN0aW1hdGUgd2hlbiAkXGxhbWJkYSA9IDEkCmBgYHtyfQojIyBGaWxsIGluIHRoaXMgcGFydApgYGAKCjMuIFJlZG8gdGhlIHByZXZpb3VzIHN0ZXAgd2l0aCBkaWZmZXJlbnQgdmFsdWVzIGZvciAkXGxhbWJkYSQgYW5kIHBsb3QgdGhlIGV2b2x1dGlvbiBvZiB0aGUgcGFyYW1ldGVyIGVzdGltYXRlIGFzICRcbGFtYmRhJCBpbmNyZWFzZXMuCmBgYHtyfQojIyBGaWxsIGluIHRoaXMgcGFydApgYGAKCjQuIERvIHRoZSBzYW1lIHdpdGggdGhlIGxhc3NvLgpgYGB7cn0KIyMgRmlsbCBpbiB0aGlzIHBhcnQKYGBgCgo1LiBEbyB0aGUgc2FtZSB3aXRoIGFuIGVsYXN0aWMgbmV0IHdpdGggJFxhbHBoYSA9IDAuNSQuCmBgYHtyfQojIyBGaWxsIGluIHRoaXMgcGFydApgYGAKCiMjIyBjKSBBbmFseXNpcyB1c2luZyB0aGUgZGVkaWNhdGVkICpSKiBmdW5jdGlvbgoKMS4gUmVhZCB0aGUgaGVscCBmaWxlIG9mIHRoZSAqZ2xtbmV0KiBmdW5jdGlvbi4KMi4gUmVkbyB0aGUgcHJldmlvdXMgZml0cyB3aXRoIHRoaXMgZnVuY3Rpb24gYW5kIGNvbXBhcmUgeW91ciByZXN1bHRzCmBgYHtyfQojIyBGaWxsIGluIHRoaXMgcGFydApgYGAKCjQuIFJ1biB0aGUgZm9sbG93aW5nIGNvZGUgYW5kIGNvbW1lbnQKYGBge3IgZXZhbD1GQUxTRX0KcGxvdChmaXQpIyNmaXQgaXMgdGhlIG91dHB1dCBvZiB0aGUgcHJldmlvdXMgZ2xtIGNhbGwsIEknbSBoZXJlIHRvIGV4cGxhaW4gZWFjaCBwbG90IDstKQpwbG90KGZpdCwgeHZhciA9ICJsYW1iZGEiLCBsYWJlbCA9IFRSVUUpCmBgYAoKNS4gV2hhdCBkb2VzIHRoZSBmb2xsb3dpbmcgcGllY2VzIG9mIGNvZGUKYGBge3IgZXZhbD1GQUxTRX0KY29lZihmaXQsIHMgPSAxKQpjdmZpdCA8LSBjdi5nbG1uZXQoWCwgWSkKcGxvdChjdmZpdCkKY29lZihjdmZpdCwgcyA9ICJsYW1iZGEubWluIikKYGBgCgo2LiBSZWFkIChhZ2FpbikgdGhlIGRvY3VtZW50YXRpb24gb2YgdGhlICpjdi5nbG1uZXQqIGZ1bmN0aW9uIGFuZCBkbyBhIDUtZm9sZCBjcm9zcyB2YWxpZGF0aW9uIGZvciB0dW5pbmcgJFxsYW1iZGEkLgoKIyMgMi4gQW5hbHl6aW5nIHRoZSAqSGl0dGVyKiBkYXRhc2V0CgpUaGUgKkhpdHRlciogZGF0YXNldCBzdG9yZXMgMzIyIG9ic2VydmF0aW9uIG9mIG1ham9yIGxlYWd1ZSBCYXNlYmFsbCBwbGF5ZXJzIGZyb20gdGhlIDE5ODYgYW5kIDE5ODcgc2Vhc29uLiBUaGUgYWltIG9mIHRoaXMgc3R1ZHkgaXMgdG8gYmUgYWJsZSB0byBwcmVkaWN0IHRoZSAqU2FsYXJ5KiBnaXZlbiBzb21lIGZlYXR1cmVzLiBIZXJlIGlzIGEgcXVpY2sgcmVjYXAgb2YgdGhlIGF2YWlsYWJsZSBmZWF0dXJlIGFuZCByZXNwb25zZS4KCiAgLSBBdEJhdCBOdW1iZXIgb2YgdGltZXMgYXQgYmF0IGluIDE5ODYKICAtIEhpdHMgTnVtYmVyIG9mIGhpdHMgaW4gMTk4NgogIC0gSG1SdW4gTnVtYmVyIG9mIGhvbWUgcnVucyBpbiAxOTg2CiAgLSBSdW5zIE51bWJlciBvZiBydW5zIGluIDE5ODYKICAtIFJCSSBOdW1iZXIgb2YgcnVucyBiYXR0ZWQgaW4gMTk4NgogIC0gV2Fsa3MgTnVtYmVyIG9mIHdhbGtzIGluIDE5ODYKICAtIFllYXJzIE51bWJlciBvZiB5ZWFycyBpbiB0aGUgbWFqb3IgbGVhZ3VlcwogIC0gQ0F0QmF0IE51bWJlciBvZiB0aW1lcyBhdCBiYXQgZHVyaW5nIGhpcyBjYXJlZXIKICAtIENIaXRzIE51bWJlciBvZiBoaXRzIGR1cmluZyBoaXMgY2FyZWVyCiAgLSBDSG1SdW4gTnVtYmVyIG9mIGhvbWUgcnVucyBkdXJpbmcgaGlzIGNhcmVlcgogIC0gQ1J1bnMgTnVtYmVyIG9mIHJ1bnMgZHVyaW5nIGhpcyBjYXJlZXIKICAtIENSQkkgTnVtYmVyIG9mIHJ1bnMgYmF0dGVkIGluIGR1cmluZyBoaXMgY2FyZWVyCiAgLSBDV2Fsa3MgTnVtYmVyIG9mIHdhbGtzIGR1cmluZyBoaXMgY2FyZWVyCiAgLSBMZWFndWUgQSBmYWN0b3Igd2l0aCBsZXZlbHMgQSBhbmQgTiBpbmRpY2F0aW5nIHBsYXllcuKAmXMgbGVhZ3VlIGF0IHRoZSBlbmQgb2YgMTk4NiBEaXZpc2lvbiBBIGZhY3RvciB3aXRoIGxldmVscyBFIGFuZCBXIGluZGljYXRpbmcgcGxheWVy4oCZcyBkaXZpc2lvbiBhdCB0aGUgZW5kIG9mIDE5ODYKICAtIFB1dE91dHMgTnVtYmVyIG9mIHB1dCBvdXRzIGluIDE5ODYKICAtIEFzc2lzdHMgTnVtYmVyIG9mIGFzc2lzdHMgaW4gMTk4NgogIC0gRXJyb3JzIE51bWJlciBvZiBlcnJvcnMgaW4gMTk4NgogIC0gU2FsYXJ5IDE5ODcgYW5udWFsIHNhbGFyeSBvbiBvcGVuaW5nIGRheSBpbiB0aG91c2FuZHMgb2YgZG9sbGFycwogIC0gTmV3TGVhZ3VlIEEgZmFjdG9yIHdpdGggbGV2ZWxzIEEgYW5kIE4gaW5kaWNhdGluZyBwbGF5ZXLigJlzIGxlYWd1ZSBhdCB0aGUgYmVnaW5uaW5nIG9mIDE5ODcKICAKWW91IGNhbiBnZXQgdGhlIGRhdGEgZnJvbSBbaGVyZV0oaHR0cDovL21yaWJhdGV0LnBlcnNvLm1hdGguY25ycy5mci9DZW50cmFsZU5hbnRlcy9GYXN0VHJhY2svSGl0dGVycy5jc3YpLiAKCjEuIEltcG9ydCB0aGUgZGF0YSBhbmQgZmFtaWxpYXJpemUgeW91ciBzZWxmIHdpdGggdGhpcyBkYXRhc2V0LgoyLiBGaXQgYSByaWRnZSByZWdyZXNzaW9uIG1vZGVsLCBwbG90IHRoZSAiUmlkZ2UgcGF0aCIgYW5kIGNvbW1lbnQuCjMuIFNlbGVjdCB0aGUgYmVzdCBtb2RlbCBhY2NvcmRpbmcgdG8gYSA1LWZvbGQgY3Jvc3MgdmFsaWRhdGlvbiBwcm9jZWR1cmUuCjQuIFJlZG8gdGhlIHByZXZpb3VzIHR3byBzdGVwcyB3aXRoIHRoZSBsYXNzby4KNS4gQWNjb3JkaW5nIHRvIHlvdXIgbW9kZWwgd2hpY2ggZmVhdHVyZXMgYXBwZWFyIHRvIGJlIHNpZ25pZmljYW50IGluIHByZWRpY3RpbmcgdGhlIHNhbGFyeT8KCiMjIDMuIEdvaW5nIGEgYml0IGZ1cnRoZXIKCkFjdHVhbGx5IHRoZSAqZ2xtbmV0KiBmdW5jdGlvbiBjYW4gZG8gbXVjaCBtb3JlIHRoYW4gcmlkZ2UsIGxhc3NvIGFuZCBlbGFzdGljIG5ldCByZWdyZXNzaW9uLiBGb3IgaW5zdGFuY2Ugd2UgY2FuIGRvIGEgbGFzc28gbG9naXN0aWMgcmVncmVzc2lvbiAgYnkgcGFzc2luZyB0aGUgYXJndW1lbnQgKmZhbWlseSA9IGJpbm9taWFsKiB0byB0aGUgKmdsbW5ldCogZnVuY3Rpb24uIFJlYW5hbHl6ZSB0aGUgKkRlZmF1bHQqIGRhdGFzZXQgd2UgcHJldmlvdXNseSBtb2RlbCB3aXRoIGEgcGVuYWxpemVkIGxvZ2lzdGljIHJlZ3Jlc3Npb24u