We will (briefly) analyze the airquality dataset. Here we will try to predict the ozone concentration level w.r.t. some features.
Here we will use a simple model \[ \mathbb{E}(Y \mid X) = \beta_0 + \beta_1 \times \text{solar radiation} + \beta_2 \times \text{wind speed} + \beta_3 \times \text{temperature} \]
fit <- lm(log(Ozone) ~ Solar.R + Wind + Temp,
data = airquality)
summary(fit)
Call:
lm(formula = log(Ozone) ~ Solar.R + Wind + Temp, data = airquality)
Residuals:
Min 1Q Median 3Q Max
-2.06193 -0.29970 -0.00231 0.30756 1.23578
Coefficients:
Estimate Std. Error t value
(Intercept) -0.2621323 0.5535669 -0.474
Solar.R 0.0025152 0.0005567 4.518
Wind -0.0615625 0.0157130 -3.918
Temp 0.0491711 0.0060875 8.077
Pr(>|t|)
(Intercept) 0.636798
Solar.R 1.62e-05 ***
Wind 0.000158 ***
Temp 1.07e-12 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5086 on 107 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.6644, Adjusted R-squared: 0.655
F-statistic: 70.62 on 3 and 107 DF, p-value: < 2.2e-16
From the output we can see that the ozone level depends on the the solar radiation (since p-value is very very small).
Now do we need wind speed or temperature? For this we need to perform analysis of variance
fit1 <- lm(log(Ozone) ~ Solar.R + Wind + Temp,
data = airquality)
fit2 <- lm(log(Ozone) ~ Solar.R, data = airquality)
anova(fit2, fit1)
Analysis of Variance Table
Model 1: log(Ozone) ~ Solar.R
Model 2: log(Ozone) ~ Solar.R + Wind + Temp
Res.Df RSS Df Sum of Sq F Pr(>F)
1 109 65.313
2 107 27.675 2 37.638 72.761 < 2.2e-16
1
2 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the above output we should rather use the most complex model, i.e., fit1.
Remark: When doing anova, models need to be NESTED
Here we will perform residual analysis. Remember here we have to check 3 assumptions: - mean of error is 0 - constant variance (homoscedasticity) - normality of error
In R you just invoke
plot(fit1)
Observation number 21 appears to have a large error. Unfortunately Cook’s distance is smaller than 0.5 indicating that, from a statistical standpoint, there is no leverage issue with it. But still I will try to see what happens when we drop it from our dataset
fit1_sub <- update(fit1, subset = -21)
plot(fit1_sub)
An attempt to fix the heteroscedasticity issue, i.e., not working on the log of ozone levels.
attempt <- lm(Ozone ~ Solar.R + Wind + Temp, data = airquality)
plot(attempt)
We can see a quatratic trend on the first plot, suggesting that maybe squaring some feature may be relevant.
fit_square <- lm(Ozone ~ Solar.R + Wind + Temp +
I(Solar.R^2) + I(Wind^2) + I(Temp^2),
data = airquality)
summary(fit_square)
Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp + I(Solar.R^2) + I(Wind^2) +
I(Temp^2), data = airquality)
Residuals:
Min 1Q Median 3Q Max
-48.698 -10.926 -3.786 9.201 79.932
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.984e+02 1.016e+02 2.937 0.00408 **
Solar.R 1.347e-01 8.806e-02 1.529 0.12919
Wind -1.334e+01 2.308e+00 -5.783 7.79e-08 ***
Temp -6.585e+00 2.742e+00 -2.402 0.01808 *
I(Solar.R^2) -2.044e-04 2.549e-04 -0.802 0.42444
I(Wind^2) 4.642e-01 1.010e-01 4.594 1.22e-05 ***
I(Temp^2) 5.223e-02 1.786e-02 2.925 0.00423 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 18.3 on 104 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.7141, Adjusted R-squared: 0.6976
F-statistic: 43.29 on 6 and 104 DF, p-value: < 2.2e-16
Interestingly, the use of quadratic term seems very relevant here and we finally work with the following model
final_model <- update(fit_square, . ~ . - Solar.R - I(Solar.R^2))
summary(final_model)
Call:
lm(formula = Ozone ~ Wind + Temp + I(Wind^2) + I(Temp^2), data = airquality)
Residuals:
Min 1Q Median 3Q Max
-47.960 -11.266 -2.528 9.815 85.929
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 279.99831 106.00903 2.641 0.00945 **
Wind -12.75492 2.40170 -5.311 5.67e-07 ***
Temp -6.07823 2.84598 -2.136 0.03490 *
I(Wind^2) 0.44873 0.10566 4.247 4.52e-05 ***
I(Temp^2) 0.05069 0.01862 2.723 0.00752 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.32 on 111 degrees of freedom
(37 observations deleted due to missingness)
Multiple R-squared: 0.6688, Adjusted R-squared: 0.6569
F-statistic: 56.04 on 4 and 111 DF, p-value: < 2.2e-16
Let’s do model checking
plot(final_model)
Observation 117 might have a leverage impact. Discard it and stop please the analysis.
final_model_I_swear <- update(final_model, subset = -117)
plot(final_model_I_swear)
To do predictions from my best model, I just use the predict function. Invoking
predict(final_model_I_swear)
1 2 3 4 6
27.121380 28.980794 18.902408 12.223732 9.790593
7 8 9 11 12
21.138252 9.997918 18.305612 36.716512 19.611343
13 14 15 16 17
19.330890 15.503491 10.960297 12.362452 12.082135
18 19 20 21 22
15.198771 14.177385 16.883645 17.636498 17.230083
23 24 28 29 30
17.006472 11.415024 12.599931 31.010591 52.074830
31 38 40 41 44
37.547139 41.333244 57.824440 50.793507 47.611815
47 48 49 50 51
22.533187 26.158965 18.941218 19.328851 26.836635
62 63 64 66 67
74.170978 51.078283 40.550888 68.208438 40.567223
68 69 70 71 73
79.860386 87.157749 90.452908 70.928364 16.144423
74 76 77 78 79
31.010591 28.737703 50.206662 39.552741 60.977982
80 81 82 85 86
76.651977 44.761063 36.716512 56.227476 55.699548
87 88 89 90 91
45.187584 46.781689 67.591831 61.303138 52.831032
92 93 94 95 96
40.550888 50.206662 31.255002 50.263245 63.686217
97 98 99 100 101
58.350978 79.760829 90.244658 63.682517 71.741592
104 105 106 108 109
47.713223 36.673330 36.582044 28.635675 48.779672
110 111 112 113 114
37.547139 29.009535 30.562839 22.721737 14.857881
116 118 120 121 122
34.398631 58.651708 93.303107 121.243184 103.322620
123 124 125 126 127
94.983936 80.368883 93.975265 113.431404 100.933148
128 129 130 131 132
64.383422 38.902501 33.248237 30.562839 23.612415
133 134 135 136 137
23.988765 31.010591 20.922697 44.797218 18.209998
138 139 140 141 142
16.883891 43.656423 10.552798 26.836635 17.056796
143 144 145 146 147
47.611815 10.521342 23.201112 37.113079 17.830840
148 149 151 152 153
10.130261 31.826592 19.101881 34.895709 14.177385
will give me predictions for all my dataset. But if I want to do prediction for a new observation I need to do
new_obs <- data.frame(Wind = 100, Temp = 60)
predict(final_model_I_swear, newdata = new_obs)
1
2305.245
From the outpout below
final_model_I_swear
Call:
lm(formula = Ozone ~ Wind + Temp + I(Wind^2) + I(Temp^2), data = airquality,
subset = -117)
Coefficients:
(Intercept) Wind Temp I(Wind^2)
327.09393 -9.27857 -8.00248 0.31555
I(Temp^2)
0.06406
All things being equal, a one unit increase in wind speed decrease (in mean) the amount of ozone level by around 10.