Generalized Linear Models

Author

Mathieu RIBATET

📝 Newton-Raphson

In this exercise, we will play a bit with the Newton-Raphson and Fisher’s scoring algorithms. Althgouh this exercise is a bit theoretical, we will end with our own software implementation.

Let $f\colon \mathbb{R} \to \mathbb{R}$ be a smooth enough function, e.g., differentiable and co… For short, if you need a specific property for $f$, it is satisfied! (Yes, I’m a bit harsh here but I don’t want to bother you with technical details here since it is not a lecture about optimization). We aim at finding a root $x_*$ of $f(x) = 0$.

Let $x_* = x_0 + h$ ($h$ ‘small’ obviously). Give a Taylor expansion (to the 1st order) of $f$ around $x=x_*$. By the way, another name for this, is to give a tangent line approximation.
Using the above approximation, give an approximation for $h$ and hence a ‘better’ approximation for $x_*$. (We will obviously assume that $f'(x_0) \neq 0$)
Write a pseudo code that iterate this procedure… this is the Newton-Raphson algorithm! For those interested in the analysis of the convergence of this algorithm, Wikipedia is a good resource. Although very interesting, it is beyond the scope of this lecture.
Write a Newton-Raphson algorithm to solve the following equations

\[ x^3 - 2 x - 5 = 0, \qquad \cos x = x^3, \]

where you will use as ‘initial guess’ $x_0 = 2$ and $x_0 = 0.5$ respectively. (You can play a bit with these starting values if you feel adventurous 😉)
Let $\{f(x; \theta)\colon x \in \mathbb{R}, \theta \in \Theta\}$ be a regular parametric statistical model. Given a sample $X_1, \ldots, X_n$, write down the score equation. Write a pseudo code for the Newton-Raphson algorithm in this context. What is $-f'$ in this situation?
Application. Assume that our statistical model is the Poisson model and that we observed

\[ X_1 = 1, X_2 = 5, X_3 = 5, X_4 = 5, X_5 = 4, X_6 = 6, X_7 = 2, X_8 = 4, X_9 = 2, X_{10} = 5. \]

Give a numerical approximation of the maximum likelihood estimator and compare it to its closed form solution.
Try another implementation where you substitute $f'$ with its expected value. What is the name of this algorithm?

📝 The glm function

The objective of this exercise is to get you familiarized with the glm function. Of course, you should by now be familiar with the theory of GLMs. If not, reread the lecture notes and/or ask questions!

Read the documentation of the glm function. Pay particular attention to the family argument. You probably have to look at the family function too.

Run the following piece of code and describe what model is fitted here. What is the link function used here?

counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
df <- data.frame(treatment, outcome, counts)
fit <- glm(counts ~ outcome + treatment, family = poisson, data = df)

Comment the output of the following code

summary(fit)


Call:
glm(formula = counts ~ outcome + treatment, family = poisson, 
    data = df)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.045e+00  1.709e-01  17.815   <2e-16 ***
outcome2    -4.543e-01  2.022e-01  -2.247   0.0246 *  
outcome3    -2.930e-01  1.927e-01  -1.520   0.1285    
treatment2   1.398e-16  2.000e-01   0.000   1.0000    
treatment3  -2.416e-16  2.000e-01   0.000   1.0000    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 10.5814  on 8  degrees of freedom
Residual deviance:  5.1291  on 4  degrees of freedom
AIC: 56.761

Number of Fisher Scoring iterations: 4

Is the treatment variable statistically significant? What about outcome?
Read the documentation of the predict (well actually predict.glm) function. Make sure you understand all the different types of prediction, i.e., when you set type to either link, response or terms.

📝 Clotting times

For this exercise we are interested in modelling clotting times of blood for normal plasma diluted to nine percentage concentrations with prothrombin-free plasma; clotting was induced by two lots of thromboplastin.

Read Section 8.4.2 of the McCullagh and Nelder’s book that talks about modelling clotting times of blood and make sure you understand it!
Do the initial plot mentioned in the text and comment.
Using R, fit the models described at page 301. Are your results in agreement with what the book says?
The text says “Clearly al the terms are necessary”. How would you assert that the above statement is correct?
Do the plots described in the paragraph starting with “The plot of the Pearson residuals…”
Do you understand why the authors state that for a GLM with the Gamma family we have $\mbox{Var}(Y) \propto \mu^2$ while when fitting a usual linear model on $1/Y$ we have $\mbox{Var}(Y) \propto \mu^4$?

📝 Oesophageal cancer in Ille-et-Vilaine

For this exercise we will analyse the dataset esoph included directly within the R software.

Familiarize yourself with the dataset.
Run the following piece of code and explain what model is fitted here. Why in the R formula the left hand side has two entries ncases and ncontrols?
```
fit <- glm(cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp, data = esoph, family = binomial())
```
Give comments here.
Perform a residual analysis.
```
## Insert code here
```
Give comments here.
Interpret results.
```
## Insert code here
```
Give comments here.
Fit a quasibinomial GLM and spot any differences between this new model and the previous one.
```
## Insert code here
```
Give comments here.