Trying to be rich
We will work on the EuStockMarkets dataset which has 4 financial indexes for various European countries. Our goal is to predict if the CAC40 will increase next day.
Data pre-processing
Let’s have a look at the data.
As expected we can see that all indexed increase with time and there seems to have no anomalies in our data. To be rich we need to predict if the CAC40 will increase next day. Hence we will work with increments and values of yesterday for the other indexes.
We need to do that for each index and the way we code it above is not practical. It’s better to do this way
increments <- apply(EuStockMarkets, 2, diff)##2 means that we apply the function diff on the columns of EuStockMarkets
For real life modelling, we cannot use future observations so we have to work with the following dataset
nobs <- nrow(increments)
outcome <- factor(ifelse(increments[-1,"CAC"] > 0, "rich", "not rich"))
df <- data.frame(outcome = outcome, increments[-nobs,])
Now we can have a look at the data.
Clearly and as expected, variables are (strongly) correlated. This is unfortunate. Why? Because it means that although we have 4 features, we don’t have so many information.
Modelling
We have a binary outcome so the use of the logistic regression model is sensible. Let’s try!
##
## Call:
## glm(formula = outcome ~ ., family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0281765 0.0466056 -0.605 0.545
## DAX 0.0005787 0.0025992 0.223 0.824
## SMI -0.0026811 0.0018179 -1.475 0.140
## CAC 0.0042882 0.0028538 1.503 0.133
## FTSE -0.0011958 0.0022297 -0.536 0.592
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2575.3 on 1857 degrees of freedom
## Residual deviance: 2571.2 on 1853 degrees of freedom
## AIC: 2581.2
##
## Number of Fisher Scoring iterations: 3
The model is very poor, no statistically significant variable. Maybe we can complexify our model using squared values
##
## Call:
## glm(formula = outcome ~ .^2, family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.284e-03 5.011e-02 -0.185 0.8530
## DAX 1.490e-04 2.626e-03 0.057 0.9547
## SMI -2.285e-03 1.860e-03 -1.229 0.2192
## CAC 5.686e-03 2.946e-03 1.930 0.0536 .
## FTSE -2.134e-03 2.288e-03 -0.933 0.3508
## DAX:SMI 1.139e-05 4.447e-05 0.256 0.7979
## DAX:CAC 1.755e-06 9.148e-05 0.019 0.9847
## DAX:FTSE 4.297e-05 8.005e-05 0.537 0.5914
## SMI:CAC -9.556e-05 8.006e-05 -1.194 0.2327
## SMI:FTSE 4.631e-05 5.737e-05 0.807 0.4196
## CAC:FTSE -5.902e-05 8.869e-05 -0.665 0.5057
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2575.3 on 1857 degrees of freedom
## Residual deviance: 2563.9 on 1847 degrees of freedom
## AIC: 2585.9
##
## Number of Fisher Scoring iterations: 4
Not a good improvement. Let’s work with the sign of the increments as well.
pos_or_neg <- function(x)
ifelse(x > 0, "positive", "negative")
new_feat <- apply(df[,-1], 2, pos_or_neg)
colnames(new_feat) <- paste(colnames(new_feat), "sign", sep = "_")
df <- data.frame(df, new_feat)
Last attempt for feature engineering.
##
## Call:
## glm(formula = outcome ~ .^2, family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.041e-01 1.718e-01 1.188 0.23492
## DAX 7.573e-03 8.007e-03 0.946 0.34425
## SMI -5.238e-03 5.906e-03 -0.887 0.37519
## CAC 2.042e-03 8.135e-03 0.251 0.80182
## FTSE 1.219e-03 7.247e-03 0.168 0.86637
## DAX_signpositive -1.027e+00 3.446e-01 -2.979 0.00289 **
## SMI_signpositive -2.174e-01 2.998e-01 -0.725 0.46835
## CAC_signpositive -3.081e-01 3.511e-01 -0.878 0.38020
## FTSE_signpositive -3.103e-01 3.263e-01 -0.951 0.34171
## DAX:SMI -1.357e-05 6.382e-05 -0.213 0.83165
## DAX:CAC 1.714e-04 1.542e-04 1.112 0.26607
## DAX:FTSE 4.445e-05 1.168e-04 0.380 0.70358
## DAX:DAX_signpositive 2.623e-02 1.206e-02 2.175 0.02961 *
## DAX:SMI_signpositive -1.803e-03 9.265e-03 -0.195 0.84571
## DAX:CAC_signpositive -2.885e-02 1.202e-02 -2.400 0.01640 *
## DAX:FTSE_signpositive -1.039e-02 1.097e-02 -0.948 0.34332
## SMI:CAC -1.892e-04 1.244e-04 -1.521 0.12836
## SMI:FTSE 1.089e-04 8.936e-05 1.218 0.22311
## SMI:DAX_signpositive 3.575e-04 7.632e-03 0.047 0.96263
## SMI:SMI_signpositive 1.398e-03 7.897e-03 0.177 0.85948
## SMI:CAC_signpositive 1.866e-03 7.662e-03 0.244 0.80761
## SMI:FTSE_signpositive 1.830e-03 7.048e-03 0.260 0.79510
## CAC:FTSE -1.718e-04 1.535e-04 -1.120 0.26289
## CAC:DAX_signpositive -1.428e-02 1.213e-02 -1.177 0.23909
## CAC:SMI_signpositive 2.645e-03 1.131e-02 0.234 0.81513
## CAC:CAC_signpositive 1.559e-03 1.223e-02 0.127 0.89859
## CAC:FTSE_signpositive 2.406e-02 1.229e-02 1.958 0.05018 .
## FTSE:DAX_signpositive -4.054e-03 1.001e-02 -0.405 0.68558
## FTSE:SMI_signpositive -9.000e-03 9.250e-03 -0.973 0.33057
## FTSE:CAC_signpositive -1.051e-02 9.635e-03 -1.091 0.27532
## FTSE:FTSE_signpositive 1.701e-02 8.694e-03 1.956 0.05044 .
## DAX_signpositive:SMI_signpositive 3.762e-01 3.607e-01 1.043 0.29699
## DAX_signpositive:CAC_signpositive 8.340e-01 4.229e-01 1.972 0.04859 *
## DAX_signpositive:FTSE_signpositive 6.331e-01 4.153e-01 1.525 0.12737
## SMI_signpositive:CAC_signpositive -1.767e-02 4.101e-01 -0.043 0.96563
## SMI_signpositive:FTSE_signpositive 9.293e-02 3.967e-01 0.234 0.81477
## CAC_signpositive:FTSE_signpositive -3.306e-01 4.353e-01 -0.759 0.44756
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2575.3 on 1857 degrees of freedom
## Residual deviance: 2530.7 on 1821 degrees of freedom
## AIC: 2604.7
##
## Number of Fisher Scoring iterations: 4
Ok not so bad. Now we do model selection using the stepAIC function (a bit harsh since I don’t like automatic model selection but this is live coding so…)
##
## Call:
## glm(formula = outcome ~ DAX + SMI + CAC + FTSE + DAX_sign + CAC_sign +
## FTSE_sign + DAX:DAX_sign + DAX:CAC_sign + SMI:CAC + SMI:FTSE +
## CAC:FTSE_sign + FTSE:CAC_sign + FTSE:FTSE_sign + DAX_sign:CAC_sign,
## family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.099e-02 1.230e-01 0.415 0.67840
## DAX -3.319e-04 4.389e-03 -0.076 0.93972
## SMI -2.473e-03 1.891e-03 -1.307 0.19108
## CAC 1.569e-03 5.051e-03 0.311 0.75610
## FTSE 1.261e-03 4.931e-03 0.256 0.79821
## DAX_signpositive -4.239e-01 1.971e-01 -2.151 0.03147 *
## CAC_signpositive -3.729e-01 2.135e-01 -1.747 0.08070 .
## FTSE_signpositive -1.190e-01 1.417e-01 -0.840 0.40095
## DAX:DAX_signpositive 2.730e-02 1.015e-02 2.690 0.00715 **
## DAX:CAC_signpositive -2.682e-02 1.011e-02 -2.653 0.00797 **
## SMI:CAC -1.205e-04 5.011e-05 -2.404 0.01620 *
## SMI:FTSE 6.705e-05 4.343e-05 1.544 0.12257
## CAC:FTSE_signpositive 1.471e-02 6.779e-03 2.170 0.03000 *
## FTSE:CAC_signpositive -1.723e-02 6.464e-03 -2.666 0.00768 **
## FTSE:FTSE_signpositive 1.093e-02 7.433e-03 1.470 0.14153
## DAX_signpositive:CAC_signpositive 6.454e-01 3.046e-01 2.119 0.03413 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2575.3 on 1857 degrees of freedom
## Residual deviance: 2540.1 on 1842 degrees of freedom
## AIC: 2572.1
##
## Number of Fisher Scoring iterations: 4
Analyzing the odds
Although it is not mandatory for this application to analyze the odds, we want to earn mony but, better, learn a statistical analysis. I will do it only intrepret for variable DAX_signpositive. As \(\exp(-4.239e-01) \approx 0.65\), having a positive increment for DAX as opposed to a negative one, decrease the odds by amount of \(35%\).
Model performance: Will we be rich?
Trying to assess the performance of our binary classifier using the caret package that we need to install first
and load
## Loading required package: ggplot2
## Loading required package: lattice
pred <- factor(ifelse(predict(fit, type = "response") > 0.5, "rich", "not rich"))
confusionMatrix(pred, df$outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction not rich rich
## not rich 556 457
## rich 388 457
##
## Accuracy : 0.5452
## 95% CI : (0.5222, 0.568)
## No Information Rate : 0.5081
## P-Value [Acc > NIR] : 0.0007334
##
## Kappa : 0.0891
##
## Mcnemar's Test P-Value : 0.0193214
##
## Sensitivity : 0.5890
## Specificity : 0.5000
## Pos Pred Value : 0.5489
## Neg Pred Value : 0.5408
## Prevalence : 0.5081
## Detection Rate : 0.2992
## Detection Prevalence : 0.5452
## Balanced Accuracy : 0.5445
##
## 'Positive' Class : not rich
##
Now to get the ROC curve from the pROC package we install it
adn load it as before
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
We are now all set to plot the ROC curve
test_prob = predict(fit, newdata = df[,-1], type = "response")
test_roc = roc(df$outcome ~ test_prob, plot = TRUE, print.auc = TRUE)
## Setting levels: control = not rich, case = rich
## Setting direction: controls < cases
We can see that our logistic classifier is slightly better than the random coin classifier. Unfortunately we won’t be millionaires :-(