counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
df <- data.frame(treatment, outcome, counts)
fit <- glm(counts ~ outcome + treatment, family = poisson, data = df)Generalized Linear Models
š Newton-Raphson
In this exercise, we will play a bit with the Newton-Raphson and Fisherās scoring algorithms. Althgouh this exercise is a bit theoretical, we will end with our own software implementation.
Let \(f\colon \mathbb{R} \to \mathbb{R}\) be a smooth enough function, e.g., differentiable and co⦠For short, if you need a specific property for \(f\), it is satisfied! (Yes, Iām a bit harsh here but I donāt want to bother you with technical details here since it is not a lecture about optimization). We aim at finding a root \(x_*\) of \(f(x) = 0\).
Let \(x_* = x_0 + h\) ($h$ āsmallā obviously). Give a Taylor expansion (to the 1st order) of \(f\) around \(x=x_*\). By the way, another name for this, is to give a tangent line approximation.
Using the above approximation, give an approximation for \(h\) and hence a ābetterā approximation for \(x_*\). (We will obviously assume that \(f'(x_0) \neq 0\))
Write a pseudo code that iterate this procedure⦠this is the Newton-Raphson algorithm! For those interested in the analysis of the convergence of this algorithm, Wikipedia is a good resource. Although very interesting, it is beyond the scope of this lecture.
Write a Newton-Raphson algorithm to solve the following equations
\[ x^3 - 2 x - 5 = 0, \qquad \cos x = x^3, \]
where you will use as āinitial guessā \(x_0 = 2\) and \(x_0 = 0.5\) respectively. (You can play a bit with these starting values if you feel adventurous š)
Let \(\{f(x; \theta)\colon x \in \mathbb{R}, \theta \in \Theta\}\) be a regular parametric statistical model. Given a sample \(X_1, \ldots, X_n\), write down the score equation. Write a pseudo code for the Newton-Raphson algorithm in this context. What is \(-f'\) in this situation?
Application. Assume that our statistical model is the Poisson model and that we observed
\[ X_1 = 1, X_2 = 5, X_3 = 5, X_4 = 5, X_5 = 4, X_6 = 6, X_7 = 2, X_8 = 4, X_9 = 2, X_{10} = 5. \]
Give a numerical approximation of the maximum likelihood estimator and compare it to its closed form solution.
Try another implementation where you substitute \(f'\) with its expected value. What is the name of this algorithm?
š The glm function
The objective of this exercise is to get you familiarized with the glm function. Of course, you should by now be familiar with the theory of GLMs. If not, reread the lecture notes and/or ask questions!
Read the documentation of the glm function. Pay particular attention to the family argument. You probably have to look at the family function too.
Run the following piece of code and describe what model is fitted here. What is the link function used here?
Comment the output of the following code
summary(fit)Call: glm(formula = counts ~ outcome + treatment, family = poisson, data = df) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.045e+00 1.709e-01 17.815 <2e-16 *** outcome2 -4.543e-01 2.022e-01 -2.247 0.0246 * outcome3 -2.930e-01 1.927e-01 -1.520 0.1285 treatment2 1.398e-16 2.000e-01 0.000 1.0000 treatment3 -2.416e-16 2.000e-01 0.000 1.0000 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 10.5814 on 8 degrees of freedom Residual deviance: 5.1291 on 4 degrees of freedom AIC: 56.761 Number of Fisher Scoring iterations: 4Is the
treatmentvariable statistically significant? What aboutoutcome?Read the documentation of the
predict(well actuallypredict.glm) function. Make sure you understand all the different types of prediction, i.e., when you settypeto eitherlink,responseorterms.
š Clotting times
For this exercise we are interested in modelling clotting times of blood for normal plasma diluted to nine percentage concentrations with prothrombin-free plasma; clotting was induced by two lots of thromboplastin.
Read Section 8.4.2 of the McCullagh and Nelderās book that talks about modelling clotting times of blood and make sure you understand it!
Do the initial plot mentioned in the text and comment.
Using R, fit the models described at page 301. Are your results in agreement with what the book says?
The text says āClearly al the terms are necessaryā. How would you assert that the above statement is correct?
Do the plots described in the paragraph starting with āThe plot of the Pearson residualsā¦ā
Do you understand why the authors state that for a GLM with the Gamma family we have \(\mbox{Var}(Y) \propto \mu^2\) while when fitting a usual linear model on \(1/Y\) we have \(\mbox{Var}(Y) \propto \mu^4\)?
š Oesophageal cancer in Ille-et-Vilaine
For this exercise we will analyse the dataset esoph included directly within the R software.
Familiarize yourself with the dataset.
Run the following piece of code and explain what model is fitted here. Why in the R formula the left hand side has two entries
ncasesandncontrols?fit <- glm(cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp, data = esoph, family = binomial())Give comments here.
Perform a residual analysis.
## Insert code hereGive comments here.
Interpret results.
## Insert code hereGive comments here.
Fit a
quasibinomialGLM and spot any differences between this new model and the previous one.## Insert code hereGive comments here.