Clipping netCDF file to a shapefile and cloning the metadata variables in R - copy

I have NetCDF files (e.g https://data.ceda.ac.uk/neodc/esacci/lakes/data/lake_products/L3S/v1.0/2019 global domain), and I want to extract the data based on a shapefile boundary ( in this case a Lake here - https://www.sciencebase.gov/catalog/item/530f8a0ee4b0e7e46bd300dd) and then save clipped data as a NetCDF file but retain all the original metadata and variables names within the clipped file. This is what I have done far
library(rgdal)
library(sf)
library(ncdf4)
library(terra)
#Read in the shapefile of Lake
Lake_shape <- readOGR("C:/Users/CEDA/hydro_p_LakeA/hydro_p_A.shp")
# Reading the netcdf file using Terra Package function rast
test <- rast("ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20190705-fv1.0.nc")
# List of some of variables names for orginal dataset
head(names(test))
[1] "water_surface_height_above_reference_datum" "water_surface_height_uncertainty" "lake_surface_water_extent"
[4] "lake_surface_water_extent_uncertainty" "lake_surface_water_temperature" "lswt_uncertainty"
#Clipping data to smaller Lake domain using the crop function in Terra Package
test3 <- crop(test, Lake_shape)
#Listing the some variables names for clipped data
head(names(test3))
[1] "water_surface_height_above_reference_datum" "water_surface_height_uncertainty" "lake_surface_water_extent"
[4] "lake_surface_water_extent_uncertainty" "lake_surface_water_temperature" "lswt_uncertainty"
# Writing the crop dataset as netcdf or Raster Layer using the WriteCDF function
filepath<-"Lake_A_ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20020501-fv1.0"
fname <- paste0( "C:/Users/CEDA/",filepath,".nc")
rnc <- writeCDF(test3, filename =fname, overwrite=T)”
My main issue here when I read in clipped the netCDF file I don’t seem to be able to keep the names of the data variables of the original NetCDF. They are all being renamed automatically when I am saving the clipped dataset as a new netCDF using the writeCDF function.
#Reading in the new clipped file
LakeA<-rast("Lake_A_ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20020501-fv1.0.nc")
> head(names(LakeA))
[1] "Lake_A_ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20020501-fv1.0_1" "Lake_A_ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20020501-fv1.0_2"
[3] "Lake_A_ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20020501-fv1.0_3" "Lake_A_ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20020501-fv1.0_4"
[5] "Lake_A_ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20020501-fv1.0_5" "Lake_A_ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20020501-fv1.0_6"
So is it possible to clone/copy all the metadata variables from the original NetCDF dataset when clipping to the smaller domain/shapefile in R, then saving as NetCDF? Any guidance on how to do this in R would be really appreciated. (NetCDF and R are all new to me so I am not sure what I am missing or have the in-depth knowledge to sort this).

You have a NetCDF file with many (52) variables (sub-datasets). When you open the file with rast these become "layers". Alternatively you can open the file with sds to keep the sub-dataset structure but that does not help you here (and you would need to skip the first two, see below).
library(terra)
f <- "ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20190101-fv1.0.nc"
r <- rast(f)
r
#class : SpatRaster
#dimensions : 21600, 43200, 52 (nrow, ncol, nlyr)
#resolution : 0.008333333, 0.008333333 (x, y)
#extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
#coord. ref. : +proj=longlat +datum=WGS84 +no_defs
#sources : ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20190101-fv1.0.nc:water_surface_height_above_reference_datum
ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20190101-fv1.0.nc:water_surface_height_uncertainty
ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20190101-fv1.0.nc:lake_surface_water_extent
... and 49 more source(s)
#varnames : water_surface_height_above_reference_datum (water surface height above geoid)
water_surface_height_uncertainty (water surface height uncertainty)
lake_surface_water_extent (Lake Water Extent)
...
#names : water~datum, water~ainty, lake_~xtent, lake_~ainty, lake_~ature, lswt_~ainty, ...
#unit : m, m, km2, km2, Kelvin, Kelvin, ...
#time : 2019-01-01
Note that there are 52 layers and sources (sub-datasets). There are names
head(names(r))
#[1] "water_surface_height_above_reference_datum" "water_surface_height_uncertainty"
#[3] "lake_surface_water_extent" "lake_surface_water_extent_uncertainty"
#[5] "lake_surface_water_temperature" "lswt_uncertainty"
And also "longnames" (they are often much longer than the variable names, not in this case)
head(longnames(r))
# [1] "water surface height above geoid" "water surface height uncertainty" "Lake Water Extent"
# [4] "Water extent uncertainty" "lake surface skin temperature" "Total uncertainty"
You can also open the file with sds, but you need to skip "lon_bounds" and "lat_bounds" variables (dimensions)
s <- sds(f, 3:52)
Now read a vector data set (shapefile in this case) and crop
lake <- vect("hydro_p_LakeErie.shp")
rc <- crop(r, lake)
rc
#class : SpatRaster
#dimensions : 182, 555, 52 (nrow, ncol, nlyr)
#resolution : 0.008333333, 0.008333333 (x, y)
#extent : -83.475, -78.85, 41.38333, 42.9 (xmin, xmax, ymin, ymax)
#coord. ref. : +proj=longlat +datum=WGS84 +no_defs
#source : memory
#names : water~datum, water~ainty, lake_~xtent, lake_~ainty, lake_~ature, lswt_~ainty, ...
#min values : NaN, NaN, NaN, NaN, 271.170, 0.283, ...
#max values : NaN, NaN, NaN, NaN, 277.090, 0.622, ...
#time : 2019-01-01
It can be convenient to save this to a GTiff file like this (or even better to use the filename argument in crop)
gtf <- writeRaster(rc, "test.tif", overwrite=TRUE)
gtf
#class : SpatRaster
#dimensions : 182, 555, 52 (nrow, ncol, nlyr)
#resolution : 0.008333333, 0.008333333 (x, y)
#extent : -83.475, -78.85, 41.38333, 42.9 (xmin, xmax, ymin, ymax)
#coord. ref. : +proj=longlat +datum=WGS84 +no_defs
#source : test.tif
#names : water~datum, water~ainty, lake_~xtent, lake_~ainty, lake_~ature, lswt_~ainty, ...
#min values : NaN, NaN, NaN, NaN, 271.170, 0.283, ...
#max values : NaN, NaN, NaN, NaN, 277.090, 0.622, ...
What has changed is that the data are now in a file, rather then in memory. And you still have the layer (variable) names.
To write the layers as variables to a NetCDF file you need to create a SpatRasterDataset. You can do that like this:
x <- as.list(rc)
s <- sds(x)
names(s) <- names(rc)
longnames(s) <- longnames(r)
units(s) <- units(r)
Note the use of longnames(r) and units(r) (not rc). This is because r has subdatasets (and each has a longname and a unit) while rc does not.
Now use writeCDF
z <- writeCDF(s, "test.nc", overwrite=TRUE)
rc2 <- rast("test.nc")
rc2
#class : SpatRaster
#dimensions : 182, 555, 52 (nrow, ncol, nlyr)
#resolution : 0.008333333, 0.008333333 (x, y)
#extent : -83.475, -78.85, 41.38333, 42.9 (xmin, xmax, ymin, ymax)
#coord. ref. : +proj=longlat +datum=WGS84 +no_defs
#sources : test.nc:water_surface_height_above_reference_datum
test.nc:water_surface_height_uncertainty
test.nc:lake_surface_water_extent
... and 49 more source(s)
#varnames : water_surface_height_above_reference_datum (water surface height above geoid)
water_surface_height_uncertainty (water surface height uncertainty)
lake_surface_water_extent (Lake Water Extent)
...
#names : water~datum, water~ainty, lake_~xtent, lake_~ainty, lake_~ature, lswt_~ainty, ...
#unit : m, m, km2, km2, Kelvin, Kelvin, ...
#time : 2019-01-01
So it looks like we have a NetCDF with the same structure.
Note that the current CRAN version of terra drops the time variable if there is only one time step. The development version (1.3-11) keeps the time dimension, even of there is only one step.
You can install the development version with
install.packages('terra', repos='https://rspatial.r-universe.dev')

Related

Marginal Means accounting for the random effect uncertainty

When we have repeated measurements on an experimental unit, typically these units cannot be considered 'independent' and need to be modeled in a way that we get valid estimates for our standard errors.
When I compare the intervals obtained by computing the marginal means for the treatment using a mixed model (treating the unit as a random effect) and in the other case, first averaging over the unit and THEN runnning a simple linear model on the averaged responses, I get the exact same uncertainty intervals.
How do we incorporate the uncertainty of the measurements of the unit, into the uncertainty of what we think our treatments look like?
In order to really propogate all the uncertainty, shouldn't we see what the treatment looks like, averaged over "all possible measurements" on a unit?
``` r
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(emmeans)
library(lme4)
#> Loading required package: Matrix
library(ggplot2)
tmp <- structure(list(treatment = c("A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B"), response = c(151.27333548, 162.3933313,
159.2199999, 159.16666725, 210.82, 204.18666667, 196.97333333,
194.54666667, 154.18666667, 194.99333333, 193.48, 191.71333333,
124.1, 109.32666667, 105.32, 102.22, 110.83333333, 114.66666667,
110.54, 107.82, 105.62000069, 79.79999821, 77.58666557, 75.78666928
), experimental_unit = c("A-1", "A-1", "A-1", "A-1", "A-2", "A-2",
"A-2", "A-2", "A-3", "A-3", "A-3", "A-3", "B-1", "B-1", "B-1",
"B-1", "B-2", "B-2", "B-2", "B-2", "B-3", "B-3", "B-3", "B-3"
)), row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"
))
### Option 1 - Treat the experimental unit as a random effect since there are
### 4 repeat observations for the same unit
lme4::lmer(response ~ treatment + (1 | experimental_unit), data = tmp) %>%
emmeans::emmeans(., ~ treatment) %>%
as.data.frame()
#> treatment emmean SE df lower.CL upper.CL
#> 1 A 181.0794 10.83359 4 151.00058 211.1583
#> 2 B 101.9683 10.83359 4 71.88947 132.0472
#ggplot(.,aes(treatment, emmean)) +
#geom_pointrange(aes(ymin = lower.CL, ymax = upper.CL))
### Option 2 - instead of treating the unit as random effect, we average over the
### 4 repeat observations, and run a simple linear model
tmp %>%
group_by(experimental_unit) %>%
summarise(mean_response = mean(response)) %>%
mutate(treatment = c(rep("A", 3), rep("B", 3))) %>%
lm(mean_response ~ treatment, data = .) %>%
emmeans::emmeans(., ~ treatment) %>%
as.data.frame()
#> treatment emmean SE df lower.CL upper.CL
#> 1 A 181.0794 10.83359 4 151.00058 211.1583
#> 2 B 101.9683 10.83359 4 71.88947 132.0472
#ggplot(., aes(treatment, emmean)) +
#geom_pointrange(aes(ymin = lower.CL, ymax = upper.CL))
### Whether we include a random effect for the unit, or average over it and THEN model it, we find no difference in the
### marginal means for the treatments
### How do we incoporate the variation of the repeat measurments to the marginal means of the treatments?
### Do we then ignore the variation in the 'subsamples' and simply average over them PRIOR to modeling?
<sup>Created on 2021-07-31 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)</sup>
emmeans() does take into account the errors of random effects. This is what I get when I remove the complex sequences of pipes:
> mmod = lme4::lmer(response ~ treatment + (1 | experimental_unit), data = tmp)
> emmeans(mmod, "treatment")
treatment emmean SE df lower.CL upper.CL
A 181 10.8 4 151.0 211
B 102 10.8 4 71.9 132
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
This is as shown. If I fit a fixed-effects model that accounts for experimental units as a fixed effect, I get:
> fmod = lm(response ~ treatment + experimental_unit, data = tmp)
> emmeans(fmod, "treatment")
NOTE: A nesting structure was detected in the fitted model:
experimental_unit %in% treatment
treatment emmean SE df lower.CL upper.CL
A 181 3.25 18 174.2 188
B 102 3.25 18 95.1 109
Results are averaged over the levels of: experimental_unit
Confidence level used: 0.95
The SEs of the latter results are considerably lower, and that is because the random variations in experimental_unit are modeled as fixed variations.
Apparently the piping you did accounts for the variation of the random effects and includes those in the EMMs. I think that is because you did things separately for each experimental unit and somehow combined those results. I'm not very comfortable with a sequence of pipes that is 7 steps long, and I don't understand why that results in just one set of means.
I recommend against the as.data.frame() at the end. That zaps out annotations that can be helpful in understanding what you have. If you are doing that to get more digits precision, I'll claim that those are digits you don't need, it just exaggerates the precision you are entitled to claim.
Notes on some follow-up comments
Subsequently, I am convinced that what we see in the piped operations in the second part of the OP doe indeed comprise computing the mean of each EU, then analyzing those.
Let's look at that in the context of the formal model. We have (sorry MathJax doesn't work on stackoverflow, but I'll leave the markup there anyway)
$$ Y_{ijk} = \mu + \tau_i + U_{ij} + E_{ijk} $$
where $Y_{ijk}$ is the kth response measurement on the ith treatment and jth EU in the ith treatment, and the rhs terms represent respectively the overall mean, the (fixed) treatment effects, the (random) EU effects, and the (random) error effects. We assume the random effects are all mutually independent. With a balanced design, the EMMs are just the marginal means:
$$ \bar Y_{i..} = \mu + \tau_i + \bar U_{i.} + \bar E_{i..} $$
where a '.' subscript means we averaged over that subscript. If there are n EUs per treatment and m measurements on each EU, we get that
$$ Var(\bar Y_{i..} = \sigma^2_U / n + \sigma^2_E / mn $$
Now, if we aggregate the data on EUs ahead of time, we are starting with
$$ \bar Y_{ij.} = \mu + U_{ij} + \bar E_{ij.} $$
However, if we then compute marginal means by averaging over j, we get exactly the same thing as we did before with $\bar Y_{i..}$, and the variance is exactly as already shown. That is why it doesn't matter if we aggregated first or not.

Option to cut values below a threshold in papaja::apa_table

I can't figure out how to selectively print values in a table above or below some value. What I'm looking for is known as "cut" in Revelle's psych package. MWE below.
library("psych")
library("psychTools")
derp <- fa(ability, nfactors=3)
print(derp, cut=0.5) #removes all loadings smaller than 0.5
derp <- print(derp, cut=0.5) #apa_table still doesn't print like this
Question is, how do I add that cut to an apa_table? Printing apa_table(derp) prints the entire table, including all values.
The print-method from psych does not return the formatted loadings but only the table of variance accounted for. You can, however, get the result you want by manually formatting the loadings table:
library("psych")
library("psychTools")
derp <- fa(ability, nfactors=3)
# Class `loadings` cannot be coerced to data.frame or matrix
class(derp$Structure)
[1] "loadings"
# Class `matrix` is supported by apa_table()
derp_loadings <- unclass(derp$Structure)
class(derp_loadings)
[1] "matrix"
# Remove values below "cut"
derp_loadings[derp_loadings < 0.5] <- NA
colnames(derp_loadings) <- paste("Factor", 1:3)
apa_table(
derp_loadings
, caption = "Factor loadings"
, added_stub_head = "Item"
, format = "pandoc" # Omit this in your R Markdown document
, format.args = list(na_string = "") # Don't print NA
)
*Factor loadings*
Item Factor 1 Factor 2 Factor 3
---------- --------- --------- ---------
reason.4 0.60
reason.16
reason.17 0.65
reason.19
letter.7 0.61
letter.33 0.56
letter.34 0.65
letter.58
matrix.45
matrix.46
matrix.47
matrix.55
rotate.3 0.70
rotate.4 0.73
rotate.6 0.63
rotate.8 0.63

GPflow, bvh: ValueError: mean must be 1 dimensional

I am having a weird "ValueError: mean must be 1 dimensional" when I am trying to build a Hierarchical GL-LVM model. Basically I'm trying to reproduce this paper: Hierarchical Gaussian Process Latent Variable Models using GPflow.
Therefore I implemented my own new model as follow:
class myGPLVM(gpflow.models.BayesianModel):
def __init__(self, data, latent_data, x_data_mean, kernel):
super().__init__()
print("GPLVM")
self.kernel0 = kernel[0]
self.kernel1 = kernel[1]
self.mean_function = Zero()
self.likelihood0 = gpflow.likelihoods.Gaussian(1.0)
self.likelihood1 = gpflow.likelihoods.Gaussian(1.0)
# make some parameters
self.data = (gpflow.Parameter(x_data_mean), gpflow.Parameter(latent_data), data)
def hierarchy_ll(self):
x, h, y = self.data
K = self.kernel0(x)
num_data = x.shape[0]
k_diag = tf.linalg.diag_part(K)
s_diag = tf.fill([num_data], self.likelihood0.variance)
ks = tf.linalg.set_diag(K, k_diag + s_diag)
L = tf.linalg.cholesky(ks)
m = self.mean_function(x)
return multivariate_normal(h, m, L)
def log_likelihood(self):
"""
Computes the log likelihood.
.. math::
\log p(Y | \theta).
"""
x, h, y = self.data
K = self.kernel1(h)
num_data = h.shape[0]
k_diag = tf.linalg.diag_part(K)
s_diag = tf.fill([num_data], self.likelihood1.variance)
ks = tf.linalg.set_diag(K, k_diag + s_diag)
L = tf.linalg.cholesky(ks)
m = self.mean_function(h)
# [R,] log-likelihoods for each independent dimension of Y
log_prob = multivariate_normal(y, m, L). # <- trows the error!
log_prob_h = self.hierarchy_ll()
log_likelihood = tf.reduce_sum(log_prob) + tf.reduce_sum(log_prob_h)
return log_likelihood
The model seems to work with a toy example:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=40, centers=3, n_features=12, random_state=2)
Y = tf.convert_to_tensor(X, dtype=default_float())
but fails and trough me the error when I am trying with a bvh file (the one from the paper actually). I also used Lawrence's code to read my bvh from mocap which I modified to fit python3
Anyway, it's been few a days and I am out of ideas. I tried multiple way to force my mean array "m" to be of one dimensional but nothing worked. I also tried with the "three_phase_oil_flow" dataset from the first GPLVM paper which works as well.
Therefore, I would assume that my model is correct, or at least I got some optimisation going on, and would think that perhaps the bvh reader could be the cause. But the data seems all fine to me... Especially I don't understand why when forcing multivariate function like:
m = np.zeros((np.shape(m)[0], 1))
log_prob = multivariate_normal(y, m, L)
or even with the gpflow Zero function
m = Zero(h)
log_prob = multivariate_normal(y, m, L)
it still trows me the error. Any help will be highly appreciated.
edited thanks to: Artem Artemev
The rest of the code if anyone wants to try to reproduce:
https://github.com/michaelStettler/h-GPLVM
error flow:
(venv) MacBookMichael2:stackOverflow michaelstettler$ python3 HGPLVM.py
(199, 96)
shape Y (199, 3, 38)
2020-01-26 17:00:48.104029: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-01-26 17:00:48.113609: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f8dd5ff5410 executing computations on platform Host. Devices:
2020-01-26 17:00:48.113627: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
shape Y (199, 38)
Number of points: 199 and Number of dimensions: 38
shape x_mean_latent (199, 8)
shape x_mean_init (199, 2)
HGPLVM
gpr_data (199, 2) (199, 8) (199, 38)
2020-01-26 17:00:48.139003: W tensorflow/python/util/util.cc:299] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
shape m (199, 1)
Traceback (most recent call last):
File "HGPLVM.py", line 131, in <module>
_ = opt.minimize(closure, method="bfgs", variables=model.trainable_variables, options=dict(maxiter=maxiter))
File "/Users/michaelstettler/PycharmProjects/GPflow/venv/lib/python3.6/site-packages/gpflow/optimizers/scipy.py", line 60, in minimize
**scipy_kwargs)
File "/Users/michaelstettler/PycharmProjects/GPflow/venv/lib/python3.6/site-packages/scipy/optimize/_minimize.py", line 594, in minimize
return _minimize_bfgs(fun, x0, args, jac, callback, **options)
File "/Users/michaelstettler/PycharmProjects/GPflow/venv/lib/python3.6/site-packages/scipy/optimize/optimize.py", line 998, in _minimize_bfgs
gfk = myfprime(x0)
File "/Users/michaelstettler/PycharmProjects/GPflow/venv/lib/python3.6/site-packages/scipy/optimize/optimize.py", line 327, in function_wrapper
return function(*(wrapper_args + args))
File "/Users/michaelstettler/PycharmProjects/GPflow/venv/lib/python3.6/site-packages/scipy/optimize/optimize.py", line 73, in derivative
self(x, *args)
File "/Users/michaelstettler/PycharmProjects/GPflow/venv/lib/python3.6/site-packages/scipy/optimize/optimize.py", line 65, in __call__
fg = self.fun(x, *args)
File "/Users/michaelstettler/PycharmProjects/GPflow/venv/lib/python3.6/site-packages/gpflow/optimizers/scipy.py", line 72, in _eval
loss, grads = _compute_loss_and_gradients(closure, variables)
File "/Users/michaelstettler/PycharmProjects/GPflow/venv/lib/python3.6/site-packages/gpflow/optimizers/scipy.py", line 116, in _compute_loss_and_gradients
loss = loss_cb()
File "HGPLVM.py", line 127, in closure
return - model.log_marginal_likelihood()
File "/Users/michaelstettler/PycharmProjects/GPflow/venv/lib/python3.6/site-packages/gpflow/models/model.py", line 45, in log_marginal_likelihood
return self.log_likelihood(*args, **kwargs) + self.log_prior()
File "HGPLVM.py", line 62, in log_likelihood
log_prob = multivariate_normal(y, m, L)
File "mtrand.pyx", line 3729, in numpy.random.mtrand.RandomState.multivariate_normal
ValueError: mean must be 1 dimensional
I would recommend posting a working MWE code. I have tried to use your code snippets, but it gives me errors.
I don't have issues with multivariate_normal function. If you have localised the issue correctly you can debug TF2.0 more thoroughly and find the place that causes that exception. Here is the code which I'm running:
In [2]: from sklearn.datasets.samples_generator import make_blobs
...: X, y = make_blobs(n_samples=40, centers=3, n_features=12, random_state=2)
In [10]: m = np.zeros((np.shape(y)[0], 1))
In [11]: m.shape
Out[11]: (40, 1)
In [12]: y.shape
Out[12]: (40,)
In [13]: L = np.eye(m.shape[0])
In [15]: gpflow.logdensities.multivariate_normal(y, m, L)
Out[15]:
<tf.Tensor: shape=(40,), dtype=float64, numpy=
array([ -56.75754133, ...])>

Using interp2 - The grid must be created from grid vectors which are strictly monotonically increasing

I am using MATLAB_R2016b.
I have data of this format
Temperature = [310:10:800];
Pressure = [0.1 0 1.0 10.0 100.0 1000.0];
Cv = ...
[ 73.6400 25.3290 73.5920 73.1260 69.4500 61.8600
72.8060 25.3810 72.7640 72.3450 68.9780 61.7040
71.9230 25.4380 71.8850 71.5070 68.4230 61.3140
71.0060 25.4990 70.9710 70.6290 67.8040 60.8160
70.0680 25.5640 70.0360 69.7270 67.1400 60.2840
69.1220 25.6340 69.0940 68.8140 66.4460 59.7550
68.1800 25.7070 68.1540 67.9000 65.7350 59.2500
27.6640 25.7840 67.2240 66.9940 65.0150 58.7780
27.3630 25.8640 66.3120 66.1040 64.2950 58.3390
27.1700 25.9480 65.4220 65.2330 63.5820 57.9340
27.0440 26.0340 64.5570 64.3850 62.8790 57.5600
26.9660 26.1230 63.7210 63.5640 62.1900 57.2130
26.9240 26.2150 62.9130 62.7700 61.5170 56.8890
26.9110 26.3090 62.1360 62.0050 60.8620 56.5870
26.9200 26.4050 61.3890 61.2690 60.2250 56.3020
26.9460 26.5030 33.1250 60.5620 59.6080 56.0320
26.9870 26.6030 31.8460 59.8850 59.0090 55.7750
27.0390 26.7050 31.0570 59.2360 58.4290 55.5290
27.1010 26.8080 30.5000 58.6170 57.8680 55.2920
27.1700 26.9120 30.0840 58.0280 57.3240 55.0630
27.2460 27.0170 29.7670 57.4700 56.7980 54.8410
27.3280 27.1240 29.5260 56.9450 56.2900 54.6250
27.4140 27.2320 29.3430 56.4560 55.7970 54.4150
27.5040 27.3410 29.2080 56.0070 55.3210 54.2090
27.5980 27.4500 29.1110 55.6040 54.8600 54.0080
27.6940 27.5610 29.0460 55.2610 54.4150 53.8100
27.7930 27.6720 29.0060 54.9970 53.9840 53.6160
27.8950 27.7840 28.9870 54.8470 53.5670 53.4260
27.9980 27.8970 28.9870 51.7540 53.1650 53.2390
28.1030 28.0110 29.0020 47.2710 52.7760 53.0550
28.2100 28.1250 29.0290 44.3160 52.4010 52.8750
28.3180 28.2400 29.0670 42.1390 52.0390 52.6980
28.4270 28.3550 29.1150 40.4520 51.6910 52.5230
28.5380 28.4710 29.1710 39.1070 51.3570 52.3520
28.6500 28.5880 29.2340 38.0170 51.0350 52.1840
28.7630 28.7060 29.3040 37.1240 50.7260 52.0200
28.8770 28.8240 29.3780 36.3870 50.4300 51.8580
28.9920 28.9420 29.4580 35.7750 50.1460 51.7000
29.1080 29.0610 29.5420 35.2640 49.8730 51.5440
29.2250 29.1810 29.6290 34.8380 49.6100 51.3930
29.3420 29.3010 29.7200 34.4810 49.3570 51.2440
29.4610 29.4220 29.8150 34.1820 49.1120 51.0990
29.5800 29.5440 29.9120 33.9330 48.8720 50.9570
29.6990 29.6660 30.0110 33.7250 48.6360 50.8190
29.8200 29.7880 30.1130 33.5540 48.4000 50.6830
29.9410 29.9110 30.2170 33.4130 48.1630 50.5520
30.0630 30.0340 30.3230 33.3000 47.9210 50.4230
30.1850 30.1580 30.4310 33.2100 47.6720 50.2990
30.3080 30.2820 30.5400 33.1400 47.4140 50.1770
30.4310 30.4070 30.6510 33.0890 47.1430 50.0590];
When I try to query a new [temperature, pressure] pair, for example [0.2, 341] by doing this
interp2(Temperature, Pressure, Cv, 0.2, 341)
I get the following error:
Error using griddedInterpolant
The grid vectors must be strictly monotonically increasing.
Error in interp2>makegriddedinterp (line 229)
F = griddedInterpolant(varargin{:});
Error in interp2 (line 129)
F = makegriddedinterp({X, Y}, V, method,extrap);
What am I doing wrong? And how can I get the desired result?
You need to have the same number of points in Temperature and Pressure as you do in Cv. You can generate these points using meshgrid.
[Temp, Pres] = meshgrid(Temperature, Pressure)
% Temp and Pres are both 6x50 matrices
However, you still have an issue. Temperature and Pressure must be monotonically increasing, as the error message states. This means you can't have a pressure value go down, which it does. You must change the second value in the Pressure array, or for instance you may want to swap columns 1 and 2 of the Pressure and Cv arrays
Pressure = [0.1, 0, 1, 10, 100, 1000]; % original
Pressure = Pressure([2, 1, 3:end]); % swap columns 1 and 2, Pressure = [0,0.1,1,10,...]
Cv = [...]; % All of your data
Cv = Cv(:, [2, 1, 3:end]) % Swap columns 1 and 2
Now you can do your lookup. Note you also had the temperature and pressure values the wrong way around, they must be the same order for inputs 1/2 and inputs 4/5.
[Temp, Pres] = meshgrid(Temperature, Pressure);
out = interp2(Temp, Pres, Cv.', 341, 0.2) % not (..., 0.2, 341) as must be same order
>> out = 30.5468
The only consideration you might want to give is that you're using linear interpolation with interp2, but your pressure data is logarithmic. Check your results are sensible.

Compare contrasts in linear model in Python (like Rs contrast library?)

In R I can do the following to compare two contrasts from a linear model:
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/spider_wolff_gorb_2013.csv"
filename <- "spider_wolff_gorb_2013.csv"
install.packages("downloader", repos="http://cran.us.r-project.org")
library(downloader)
if (!file.exists(filename)) download(url, filename)
spider <- read.csv(filename, skip=1)
head(spider, 5)
# leg type friction
# 1 L1 pull 0.90
# 2 L1 pull 0.91
# 3 L1 pull 0.86
# 4 L1 pull 0.85
# 5 L1 pull 0.80
fit = lm(friction ~ type + leg, data=spider)
fit
# Call:
# lm(formula = friction ~ type + leg, data = spider)
#
# Coefficients:
# (Intercept) typepush legL2 legL3 legL4
# 1.0539 -0.7790 0.1719 0.1605 0.2813
install.packages("contrast", repos="http://cran.us.r-project.org")
library(contrast)
l4vsl2 = contrast(fit, list(leg="L4", type="pull"), list(leg="L2",type="pull"))
l4vsl2
# lm model parameter contrast
#
# Contrast S.E. Lower Upper t df Pr(>|t|)
# 0.1094167 0.04462392 0.02157158 0.1972618 2.45 277 0.0148
I have found out how to do much of the above in Python:
import pandas as pd
df = pd.read_table("https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/spider_wolff_gorb_2013.csv", sep=",", skiprows=1)
df.head(2)
import statsmodels.formula.api as sm
model1 = sm.ols(formula='friction ~ type + leg', data=df)
fitted1 = model1.fit()
print(fitted1.summary())
Now all that remains is finding the t-statistic for the contrast of leg pair L4 vs. leg pair L2. Is this possible in Python?
statsmodels is still missing some predefined contrasts, but the t_test and wald_test or f_test methods of the model Results classes can be used to test linear (or affine) restrictions. The restrictions either be given by arrays or by strings using the parameter names.
Details for how to specify contrasts/restrictions should be in the documentation
for example
>>> tt = fitted1.t_test("leg[T.L4] - leg[T.L2]")
>>> print(tt.summary())
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 0.1094 0.045 2.452 0.015 0.022 0.197
==============================================================================
The results are attributes or methods in the instance that is returned by t_test. For example the conf_int can be obtained by
>>> tt.conf_int()
array([[ 0.02157158, 0.19726175]])
t_test is vectorized and treats each restriction or contrast as separate hypothesis. wald_test treats a list of restrictions as joint hypothesis:
>>> tt = fitted1.t_test(["leg[T.L3] - leg[T.L2], leg[T.L4] - leg[T.L2]"])
>>> print(tt.summary())
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 -0.0114 0.043 -0.265 0.792 -0.096 0.074
c1 0.1094 0.045 2.452 0.015 0.022 0.197
==============================================================================
>>> tt = fitted1.wald_test(["leg[T.L3] - leg[T.L2], leg[T.L4] - leg[T.L2]"])
>>> print(tt.summary())
<F test: F=array([[ 8.10128575]]), p=0.00038081249480917173, df_denom=277, df_num=2>
Aside: this also works for robust covariance matrices if cov_type was specified as argument to fit.