Trying to be rich

We will work on the EuStockMarkets dataset which has 4 financial indexes for various European countries. Our goal is to predict if the CAC40 will increase next day.

Data pre-processing

Let’s have a look at the data.

plot(EuStockMarkets)

As expected we can see that all indexed increase with time and there seems to have no anomalies in our data. To be rich we need to predict if the CAC40 will increase next day. Hence we will work with increments and values of yesterday for the other indexes.

inc_cac <- diff(EuStockMarkets[,"CAC"])## increments for CAC40

We need to do that for each index and the way we code it above is not practical. It’s better to do this way

increments <- apply(EuStockMarkets, 2, diff)##2 means that we apply the function diff on the columns of EuStockMarkets

For real life modelling, we cannot use future observations so we have to work with the following dataset

nobs <- nrow(increments)
outcome <- factor(ifelse(increments[-1,"CAC"] > 0, "rich", "not rich"))
df <- data.frame(outcome = outcome, increments[-nobs,])

Now we can have a look at the data.

plot(df[,-1], col = df$outcome)

Clearly and as expected, variables are (strongly) correlated. This is unfortunate. Why? Because it means that although we have 4 features, we don’t have so many information.

Modelling

We have a binary outcome so the use of the logistic regression model is sensible. Let’s try!

first_fit <- glm(outcome ~ ., data = df, family = binomial)
summary(first_fit)
## 
## Call:
## glm(formula = outcome ~ ., family = binomial, data = df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0281765  0.0466056  -0.605    0.545
## DAX          0.0005787  0.0025992   0.223    0.824
## SMI         -0.0026811  0.0018179  -1.475    0.140
## CAC          0.0042882  0.0028538   1.503    0.133
## FTSE        -0.0011958  0.0022297  -0.536    0.592
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2575.3  on 1857  degrees of freedom
## Residual deviance: 2571.2  on 1853  degrees of freedom
## AIC: 2581.2
## 
## Number of Fisher Scoring iterations: 3

The model is very poor, no statistically significant variable. Maybe we can complexify our model using squared values

fit_second <- glm(outcome ~ .^2, data = df, family = binomial)
summary(fit_second)
## 
## Call:
## glm(formula = outcome ~ .^2, family = binomial, data = df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -9.284e-03  5.011e-02  -0.185   0.8530  
## DAX          1.490e-04  2.626e-03   0.057   0.9547  
## SMI         -2.285e-03  1.860e-03  -1.229   0.2192  
## CAC          5.686e-03  2.946e-03   1.930   0.0536 .
## FTSE        -2.134e-03  2.288e-03  -0.933   0.3508  
## DAX:SMI      1.139e-05  4.447e-05   0.256   0.7979  
## DAX:CAC      1.755e-06  9.148e-05   0.019   0.9847  
## DAX:FTSE     4.297e-05  8.005e-05   0.537   0.5914  
## SMI:CAC     -9.556e-05  8.006e-05  -1.194   0.2327  
## SMI:FTSE     4.631e-05  5.737e-05   0.807   0.4196  
## CAC:FTSE    -5.902e-05  8.869e-05  -0.665   0.5057  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2575.3  on 1857  degrees of freedom
## Residual deviance: 2563.9  on 1847  degrees of freedom
## AIC: 2585.9
## 
## Number of Fisher Scoring iterations: 4

Not a good improvement. Let’s work with the sign of the increments as well.

pos_or_neg <- function(x)
  ifelse(x > 0, "positive", "negative")

new_feat <- apply(df[,-1], 2, pos_or_neg)
colnames(new_feat) <- paste(colnames(new_feat), "sign", sep = "_")
df <- data.frame(df, new_feat)

Last attempt for feature engineering.

fit <- glm(outcome ~ .^2, data = df, family = binomial)
summary(fit)
## 
## Call:
## glm(formula = outcome ~ .^2, family = binomial, data = df)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                         2.041e-01  1.718e-01   1.188  0.23492   
## DAX                                 7.573e-03  8.007e-03   0.946  0.34425   
## SMI                                -5.238e-03  5.906e-03  -0.887  0.37519   
## CAC                                 2.042e-03  8.135e-03   0.251  0.80182   
## FTSE                                1.219e-03  7.247e-03   0.168  0.86637   
## DAX_signpositive                   -1.027e+00  3.446e-01  -2.979  0.00289 **
## SMI_signpositive                   -2.174e-01  2.998e-01  -0.725  0.46835   
## CAC_signpositive                   -3.081e-01  3.511e-01  -0.878  0.38020   
## FTSE_signpositive                  -3.103e-01  3.263e-01  -0.951  0.34171   
## DAX:SMI                            -1.357e-05  6.382e-05  -0.213  0.83165   
## DAX:CAC                             1.714e-04  1.542e-04   1.112  0.26607   
## DAX:FTSE                            4.445e-05  1.168e-04   0.380  0.70358   
## DAX:DAX_signpositive                2.623e-02  1.206e-02   2.175  0.02961 * 
## DAX:SMI_signpositive               -1.803e-03  9.265e-03  -0.195  0.84571   
## DAX:CAC_signpositive               -2.885e-02  1.202e-02  -2.400  0.01640 * 
## DAX:FTSE_signpositive              -1.039e-02  1.097e-02  -0.948  0.34332   
## SMI:CAC                            -1.892e-04  1.244e-04  -1.521  0.12836   
## SMI:FTSE                            1.089e-04  8.936e-05   1.218  0.22311   
## SMI:DAX_signpositive                3.575e-04  7.632e-03   0.047  0.96263   
## SMI:SMI_signpositive                1.398e-03  7.897e-03   0.177  0.85948   
## SMI:CAC_signpositive                1.866e-03  7.662e-03   0.244  0.80761   
## SMI:FTSE_signpositive               1.830e-03  7.048e-03   0.260  0.79510   
## CAC:FTSE                           -1.718e-04  1.535e-04  -1.120  0.26289   
## CAC:DAX_signpositive               -1.428e-02  1.213e-02  -1.177  0.23909   
## CAC:SMI_signpositive                2.645e-03  1.131e-02   0.234  0.81513   
## CAC:CAC_signpositive                1.559e-03  1.223e-02   0.127  0.89859   
## CAC:FTSE_signpositive               2.406e-02  1.229e-02   1.958  0.05018 . 
## FTSE:DAX_signpositive              -4.054e-03  1.001e-02  -0.405  0.68558   
## FTSE:SMI_signpositive              -9.000e-03  9.250e-03  -0.973  0.33057   
## FTSE:CAC_signpositive              -1.051e-02  9.635e-03  -1.091  0.27532   
## FTSE:FTSE_signpositive              1.701e-02  8.694e-03   1.956  0.05044 . 
## DAX_signpositive:SMI_signpositive   3.762e-01  3.607e-01   1.043  0.29699   
## DAX_signpositive:CAC_signpositive   8.340e-01  4.229e-01   1.972  0.04859 * 
## DAX_signpositive:FTSE_signpositive  6.331e-01  4.153e-01   1.525  0.12737   
## SMI_signpositive:CAC_signpositive  -1.767e-02  4.101e-01  -0.043  0.96563   
## SMI_signpositive:FTSE_signpositive  9.293e-02  3.967e-01   0.234  0.81477   
## CAC_signpositive:FTSE_signpositive -3.306e-01  4.353e-01  -0.759  0.44756   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2575.3  on 1857  degrees of freedom
## Residual deviance: 2530.7  on 1821  degrees of freedom
## AIC: 2604.7
## 
## Number of Fisher Scoring iterations: 4

Ok not so bad. Now we do model selection using the stepAIC function (a bit harsh since I don’t like automatic model selection but this is live coding so…)

library(MASS)
fit <- stepAIC(fit, trace = 0)
summary(fit)
## 
## Call:
## glm(formula = outcome ~ DAX + SMI + CAC + FTSE + DAX_sign + CAC_sign + 
##     FTSE_sign + DAX:DAX_sign + DAX:CAC_sign + SMI:CAC + SMI:FTSE + 
##     CAC:FTSE_sign + FTSE:CAC_sign + FTSE:FTSE_sign + DAX_sign:CAC_sign, 
##     family = binomial, data = df)
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                        5.099e-02  1.230e-01   0.415  0.67840   
## DAX                               -3.319e-04  4.389e-03  -0.076  0.93972   
## SMI                               -2.473e-03  1.891e-03  -1.307  0.19108   
## CAC                                1.569e-03  5.051e-03   0.311  0.75610   
## FTSE                               1.261e-03  4.931e-03   0.256  0.79821   
## DAX_signpositive                  -4.239e-01  1.971e-01  -2.151  0.03147 * 
## CAC_signpositive                  -3.729e-01  2.135e-01  -1.747  0.08070 . 
## FTSE_signpositive                 -1.190e-01  1.417e-01  -0.840  0.40095   
## DAX:DAX_signpositive               2.730e-02  1.015e-02   2.690  0.00715 **
## DAX:CAC_signpositive              -2.682e-02  1.011e-02  -2.653  0.00797 **
## SMI:CAC                           -1.205e-04  5.011e-05  -2.404  0.01620 * 
## SMI:FTSE                           6.705e-05  4.343e-05   1.544  0.12257   
## CAC:FTSE_signpositive              1.471e-02  6.779e-03   2.170  0.03000 * 
## FTSE:CAC_signpositive             -1.723e-02  6.464e-03  -2.666  0.00768 **
## FTSE:FTSE_signpositive             1.093e-02  7.433e-03   1.470  0.14153   
## DAX_signpositive:CAC_signpositive  6.454e-01  3.046e-01   2.119  0.03413 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2575.3  on 1857  degrees of freedom
## Residual deviance: 2540.1  on 1842  degrees of freedom
## AIC: 2572.1
## 
## Number of Fisher Scoring iterations: 4

Analyzing the odds

Although it is not mandatory for this application to analyze the odds, we want to earn mony but, better, learn a statistical analysis. I will do it only intrepret for variable DAX_signpositive. As \(\exp(-4.239e-01) \approx 0.65\), having a positive increment for DAX as opposed to a negative one, decrease the odds by amount of \(35%\).

Model performance: Will we be rich?

Trying to assess the performance of our binary classifier using the caret package that we need to install first

install.packages("caret")

and load

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
pred <- factor(ifelse(predict(fit, type = "response") > 0.5, "rich", "not rich"))
confusionMatrix(pred, df$outcome)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction not rich rich
##   not rich      556  457
##   rich          388  457
##                                          
##                Accuracy : 0.5452         
##                  95% CI : (0.5222, 0.568)
##     No Information Rate : 0.5081         
##     P-Value [Acc > NIR] : 0.0007334      
##                                          
##                   Kappa : 0.0891         
##                                          
##  Mcnemar's Test P-Value : 0.0193214      
##                                          
##             Sensitivity : 0.5890         
##             Specificity : 0.5000         
##          Pos Pred Value : 0.5489         
##          Neg Pred Value : 0.5408         
##              Prevalence : 0.5081         
##          Detection Rate : 0.2992         
##    Detection Prevalence : 0.5452         
##       Balanced Accuracy : 0.5445         
##                                          
##        'Positive' Class : not rich       
## 

Now to get the ROC curve from the pROC package we install it

install.packages("pROC")

adn load it as before

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

We are now all set to plot the ROC curve

test_prob = predict(fit, newdata = df[,-1], type = "response")
test_roc = roc(df$outcome ~ test_prob, plot = TRUE, print.auc = TRUE)
## Setting levels: control = not rich, case = rich
## Setting direction: controls < cases

We can see that our logistic classifier is slightly better than the random coin classifier. Unfortunately we won’t be millionaires :-(