Geostatistics with the R software
Before we start
Before we start, I will use some colors that are suited for color blinded people. These colors come from the RColorBrewer package. I suggest you to do that for your work, it is a small extra work for us but means a lot to some of us!
Introduction
For this lab we are going to model the spatial distribution of temperatures over France. To do so we will use the geoR package that you may need to install first (do it once only!), i.e.,
you’re all set to load the brand new library
## --------------------------------------------------------------
## Analysis of Geostatistical Data
## For an Introduction to geoR go to http://www.leg.ufpr.br/geoR
## geoR version 1.9-2 (built on 2022-08-09) is now loaded
## --------------------------------------------------------------
The data consist of two separate files:
- the list of the weather stations
- time series for all weather stations
To retrive the list of the weather stations you can invoke the following lines
url_stations <- "https://donneespubliques.meteofrance.fr/donnees_libres/Txt/Synop/postesSynop.csv"
desfile <- "./stations.csv"
if (!file.exists(desfile))
download.file(url_stations, destfile = desfile)
## Read the csv file
stations <- read.csv2(desfile, dec = ".")
## Keep only mainland stations
## all my apologies for Corsica and "territoire ultra marins"
stations <- stations[1:40,]
head(stations)
## ID Nom Latitude Longitude Altitude
## 1 7005 ABBEVILLE 50.13600 1.834000 69
## 2 7015 LILLE-LESQUIN 50.57000 3.097500 47
## 3 7020 PTE DE LA HAGUE 49.72517 -1.939833 6
## 4 7027 CAEN-CARPIQUET 49.18000 -0.456167 67
## 5 7037 ROUEN-BOOS 49.38300 1.181667 151
## 6 7072 REIMS-PRUNAY 49.20967 4.155333 95
To plot the locations of these weather stations we need the additional library ggmap (to be installed first)
Update: To plot maps, the above packages now requires a Google API key. I don’t want to force you on getting one. Hence we will use a different approach, less user friendly though and fancy.
## Loading required package: terra
## terra 1.7.55
france <- gadm(country = "FRA", level = 0, path=tempdir())
plot(france)
points(stations$Longitude, stations$Latitude, pch = 15, col = my_colors[1])
Time series for each stations and a given month are available from this link.
For this lab I did the work for you and choose October, 2023 (Why? I don’t know… Maybe because we are November, 2023…). To retrieve the time series, just invoke
url <- "https://mribatet.perso.math.cnrs.fr/CentraleNantes/Data/synop.202310.csv"
## To handle the date correctly it will be easier for us to
## treat it as a character string
colClasses <- rep("numeric", 60)##there are 60 variables
colClasses[2] <- "character"
data <- read.csv2(url, na.strings = "mq", dec = ".",
colClasses = colClasses)
## Previously we selected only mainland stations
## we need to do the same here
data <- subset(data, data$numer_sta %in% stations$ID)
head(data)
## numer_sta date pmer tend cod_tend dd ff t td u vv
## 1 7005 20231001000000 102460 -40 8 150 2.1 286.85 285.05 89 3350
## 2 7015 20231001000000 102470 -50 8 160 1.8 285.15 283.55 90 19740
## 3 7020 20231001000000 102270 -30 6 200 4.8 291.35 289.65 90 16000
## 4 7027 20231001000000 102400 -40 8 140 1.3 286.15 285.35 95 17240
## 5 7037 20231001000000 102500 -20 8 150 2.3 286.15 285.05 93 19430
## 6 7072 20231001000000 102610 -30 6 100 1.4 283.35 282.45 94 18660
## ww w1 w2 n nbas hbas cl cm ch pres niv_bar geop tend24 tn12 tn24 tx12 tx24
## 1 10 NA NA NA NA NA NA NA NA 101590 NA NA -190 NA NA NA NA
## 2 0 NA NA NA 0 NA NA NA NA 101900 NA NA -130 NA NA NA NA
## 3 1 0 NA 25 0 7000 30 20 15 102160 NA NA -380 NA NA NA NA
## 4 0 NA NA NA 0 NA NA NA NA 101590 NA NA -250 NA NA NA NA
## 5 0 NA NA NA 0 NA NA NA NA 100610 NA NA -220 NA NA NA NA
## 6 0 NA NA NA 0 NA NA NA NA 101450 NA NA -90 NA NA NA NA
## tminsol sw tw raf10 rafper per etat_sol ht_neige ssfrai perssfrai rr1 rr3
## 1 285.85 NA NA 3.9 4.4 -10 0 0 NA NA 0 -0.1
## 2 285.95 NA NA 3.4 3.4 -10 0 0 NA NA 0 0.0
## 3 NA NA NA 6.8 8.0 -10 NA NA NA NA 0 0.0
## 4 288.45 NA NA 2.4 3.8 -10 0 0 NA NA 0 0.0
## 5 285.45 NA NA 2.7 3.4 -10 NA 0 NA NA 0 0.0
## 6 282.55 NA NA 2.3 3.1 -10 0 0 NA NA 0 0.0
## rr6 rr12 rr24 phenspe1 phenspe2 phenspe3 phenspe4 nnuage1 ctype1 hnuage1
## 1 -0.1 -0.1 0.2 NA NA NA NA NA NA NA
## 2 0.0 0.0 0.2 NA NA NA NA NA NA NA
## 3 0.0 0.0 0.0 NA NA NA NA 2 0 7000
## 4 0.0 0.0 0.2 NA NA NA NA NA NA NA
## 5 0.0 0.0 0.2 NA NA NA NA NA NA NA
## 6 0.0 0.0 0.2 NA NA NA NA NA NA NA
## nnuage2 ctype2 hnuage2 nnuage3 ctype3 hnuage3 nnuage4 ctype4 hnuage4 X
## 1 NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA NA NA
Data preprocessing:
From the above output, you can see that there are plenty of variables in our dataset, but, as stated previously, we will work on temperature only, i.e., column named ‘t’. We will thus only keep the columns:
- ‘numer_sta’: the unique identifier of the station
- ‘date’: the date of record (AAAAMMDDHHMISS)
- ‘t’: the temperature in Kelvin
The date has an uncommon format and we need the software to take care of that. R can deal with date out of the box but working with hours, minutes and seconds needs an additional library
We are now ready to treat the date as usual
##
## Attaching package: 'chron'
## The following objects are masked from 'package:terra':
##
## origin, origin<-
date <- substr(data$date, 1, 8)## first 8 characters
date <- as.Date(date, format = "%Y%m%d")
hour <- substr(data$date, 9, 10)
minute <- substr(data$date, 11, 12)
second <- substr(data$date, 13, 14)
tmp <- paste(hour, minute, second, sep = ":")
data$date <- as.chron(date, tmp)
Let’s try to have a look at the time series at some given locations
## Focus on k specific stations
locations <- c("PERPIGNAN", "MONTPELLIER", "NICE")
selected_data <- NULL
for (location in locations){
## Get its unique identifier
location_id <- stations$ID[stations$Nom == location]
## Retrieve the corresponding time series
data_location <- subset(data, data$numer_sta == location_id)
selected_data <- cbind(selected_data, data_location$t)
}
## Warning in cbind(selected_data, data_location$t): number of rows of result is
## not a multiple of vector length (arg 2)
matplot(selected_data, type = "l", lty = 1, xlab = "Index",
ylab = "Temperature (C)", col = my_colors[1:3])
legend("topleft", locations, bty = "n", lty = 1, col = my_colors[1:3],
ncol = 3)
Though not invariably, geostatistics often deals with a single realization of the spatial process. Hence we will focus on the specitif time October, 15th at 9AM (why? I don’t know again), i.e.,
The geoR package
The geoR package is very convenient for modelling spatial data. As far as I know there is no equivalent in Python where available libraries are extremely poor. Anyway we won’t start a controversy between R and Python here, but always wonder what does this software brings to me ;-)
When using the geoR package, it is convenient to use its base class geoData. Although there are many ways to do it, we will use the following code (make sure to understand what’s going on here)
## Create a data frame with all the variables
df <- data.frame(stations, temperature = data$t)
geodata <- as.geodata(df, coords.col = 4:3,##4:3 to have coordinates expressed as (lon, lat) not (lat, lon)
data.col = 6, covar.col = 5)
## as.geodata: 1 points removed due to NA in the data
Descriptive analysis
Using our just created geodata object, one can do our first descriptive plot.
- Read the help file of the plot.geodata function.
- Run the following code and comment.
##Note that I use 'plot' not 'plot.geodata' as R knows the class
plot(geodata, qt.col = my_colors[1:4])
- Plot the data w.r.t. elevation and comment.
- If needed re-read the documentation of the plot.geodata to remove any potential trend w.r.t. latitude.
- Read the documentation of the variog and variog4£ functions and plot your first empirical variograms.
- Compare the raw (binned) empirical variogram and that where the potential trend w.r.t. latitude was removed. Comment.
Fitting
- Play (a bit don’t spend the rest of the lecture on it ;-)) with the eyefit function to learn how the variogram / covariance parameters impact the variogram behaviour.
- Fit an exponential variogram using the variofit function without any nugget effect.
- (Optional) Try different variogram families.
Model checking
- Read the documentation of the xvalid function and use it to assess if your model is accurate enough.
What could you possibly do to get better performance. ## Prediction
Read the documentation of the krige.control and krige.conv functions.
Perform your first spatial interpolation on the following grid