How do I get gtsummary to exponentiate a weibull model? - gtsummary

I'm running a Weibull regression using the survreg function in the survival package.
I'd like to use gtsummary to then produce a table of the results.
But when I try to exponentiate the beta coefficient, it just returns me the same beta value - but says it's exp(B) in the header.
reprex:
library(survival)
library(gtsummary)
fit_w<-survreg(Surv(time, status) ~ sex, dist='weibull', data=lung)
tbl_regression(fit_w)
tbl_regression(fit_w,
exponentiate=TRUE)

The default tidying function is broom::tidy.survreg() which does NOT have an exponentiate argument for survreg() models. The reason it is not there is because most of the models created with this function are NOT proportional hazards models, meaning that you should not exponentiate because the result is not a hazard ratio.
If you confidently feel like your model should have an exponentiated coef, then you can write a small wrapper for broom::tidy.survreg() that includes a exponentiate argument.

Related

Use of priors on hyper-parameters with SVGP model

I wanted to use priors on hyper-parameters as in (https://gpflow.readthedocs.io/en/develop/notebooks/advanced/mcmc.html) but with an SVGP model.
Following the steps of example 1, I got an error when I run de run_chain_fn :
TypeError: maximum_log_likelihood_objective() missing 1 required positional argument: 'data'
Contrary to GPR or SGPMC, the data are not an attribut of the model, they are included as external parameter.
To avoid that problem I modified slightly SVGP class to include data as parameter (I don't care with mini-batching for now)
class SVGP_with_data(gpflow.models.SVGP):
"""This model is a tiny variation of classical SVGP. It just includes the data as an optionnal
parameter of the model, since they are necessary of MCMC sampling"""
def __init__(self,data,**kwargs):
super().__init__(**kwargs)
self.data = data
def maximum_log_likelihood_objective(self,_=None):
return self.elbo(self.data) #here we don't care about mini-batching
It seems to work well.
I couldn't find code example of SVGP with priors on hyper-parameters. Is their a more standard way to deel with this ?
Thanks !
SVGP is a GPflow model for a variational approximation. Using MCMC on the q(u) distribution parameterised by q_mu and q_sqrt doesn't make sense (if you want to do MCMC on q(u) in a sparse approximation, use SGPMC).
You can still put (hyper)priors on the hyperparameters in the SVGP model; gradient-based optimisation will then lead to the maximum a-posteriori (MAP) point estimate (as opposed to pure maximum likelihood).

How could I use 'fmi2GetDirectionalDerivative'?

I am trying to get the model Jacobian matrix from FMU, according to the following literature, I could use fmi2GetDirectionalDerivative to do this, but I am not sure what I need to do exactly.
My question is:
Could I call this function in Dymola or MATLAB?
A screenshot or an example would be very helpful.
https://ep.liu.se/ecp/132/091/ecp17132831.pdf
I am not familiar with calling DLL functions in MATLAB but this is an example in Python. FMPy (https://github.com/CATIA-Systems/FMPy) has these wrappers for running FMUs in python.
I've tested this for a simple model I've written here(How to access model jacobian from FMU or Dymola without analytical jacobian). In this case, knowns are value references of either states or inputs, unknowns are value references of derivatives or outputs.
I have had success extracting the Jacobian when exported through Dymola as Model Exchange FMU but not Co-Simulation FMU.
def get_jacobian(fmu, vr_knowns, vr_unknowns):
"""
populates jacobian from list of knowns and unknowns
can be only called after the current sim time and inputs are set
"""
jacobian = []
try:
for vr_known in vr_knowns:
for vr_unknown in vr_unknowns:
jacobian.extend(
fmu.getDirectionalDerivative(
vUnknown_ref=[vr_unknown],
vKnown_ref=[vr_known],
dvKnown=[1.0]
))
print_status(f'Jacobian Elements: {jacobian}')
except Exception as e:
print("[ERROR] cannot compute jacobian at current timestep")
print(f"[ERROR] {e}")
I use this code snippet to collect the value references for states and derivatives using FMPy:
# get FMU model description object
model_description = fmpy.read_model_description(
os.path.join(fmu_path, fmu_filename)
)
# collect the value references
vrs = {}
for variable in model_description.modelVariables:
vrs[variable.name] = variable.valueReference
# collect list of states and derivatives
states = []
derivatives = []
for derivative in model_description.derivatives:
derivatives.append(derivative.variable.name)
states.append(re.findall('^der\((.*)\)$',derivative.variable.name)[0])
# collect the value references for states and derivatives
vr_states = [vrs[x] for x in states]
vr_derivatives = [vrs[x] for x in derivatives]
There was a paper from Mitsubishi Electric at the recent NAM Modelica Conference 2020 that might be related.

Performing additional validation in LIBSVM matlab

I am working on MATLAB LIBSVM for a while to do prediction. I have a dataset out of which I use 75% for training, 15% for finding best parameters and remaining for testing. The code is given below.
trainX and trainY are the input and output training instances
testValX and testValY are the validation dataset I use
for j = 1:100
for jj = 1:10
model(j,jj) = svmtrain(trainY,trainX,...
['-s 3 -t 2 -c ' num2str(j) ' -p 0.001 -g ' num2str(jj) '-v 5']);
[predicted_label, ~, ~]=svmpredict(testValY,...
testValX,model(j,jj));
MSE(j,jj) = sum(((predicted_label-testValY).^2)/2);
end
end
[min_val,min_indi] = min(MSE(:));
best_predicted_model_rbf(i) = model(min_indi);
My question here is whether this is correct. I am creating model matrix with different values of c and g. I use -v option which is a key here. From the predicted models I use validation dataset for prediction and there by compute mean square error. Using this MSE I pick the best c and g. Since I am using -v which returns the cross validated output, is the procedure I follow correct?
First, I think there is a slight problem with the code shown, which is that num2str(jj) '-v 5']); doesn't have a space before the -v. That may cause that flag to not be read. In the other question, you stated that this 'sometimes returns a model', which is what would happen if that flag was not read. If the flag is read, you should only get a number, not a model, when the '-v' flag is used.
Second, it looks like you are doing two different things here, either one of which would be reasonable on its own. Calling svmtrain with '-v' runs cross validation on the training set. That shouldn't return a model, it should just return an mse estimate. You could use these estimates to determine which parameter setting was best, and then train one model with that setting on all of the training data.
Anyway, next you call svmpredict(y,x,model) on a hold-out validation set, testValX, but having called svmtrain with '-v', model should just be a scalar at this point. In order for this call to run correctly, you have to get the model from svmtrain without '-v', so that it is a struct. The rest of what you are doing makes sense for this case, in which you are doing hold-out validation using testValX.

function parameters in matlab wander off after curve fitting

first a little background. I'm a psychology student so my background in coding isn't on par with you guys :-)
My problem is as follow and the most important observation is that curve fitting with 2 different programs gives completly different results for my parameters, altough my graphs stay the same. The main program we have used to fit my longitudinal data is kaleidagraph and this should be seen as kinda the 'golden standard', the program I'm trying to modify is matlab.
I was trying to be smart and wrote some code (a lot at least for me) and the goal of that code was the following:
1. Taking an individual longitudinal datafile
2. curve fitting this data on a non-parametric model using lsqcurvefit
3. obtaining figures and the points where f' and f'' are zero
This all worked well (woohoo :-)) but when I started comparing the function parameters both programs generate there is a huge difference. The kaleidagraph program stays close to it's original starting values. Matlab wanders off and sometimes gets larger by a factor 1000. The graphs stay however more or less the same in both situations and both fit the data well. However it would be lovely if I would know how to make the matlab curve fitting more 'conservative' and more located near it's original starting values.
validFitPersons = true(nbValidPersons,1);
for i=1:nbValidPersons
personalData = data{validPersons(i),3};
personalData = personalData(personalData(:,1)>=minAge,:);
% Fit a specific model for all valid persons
try
opts = optimoptions(#lsqcurvefit, 'Algorithm', 'levenberg-marquardt');
[personalParams,personalRes,personalResidual] = lsqcurvefit(heightModel,initialValues,personalData(:,1),personalData(:,2),[],[],opts);
catch
x=1;
end
Above is a the part of the code i've written to fit the datafiles into a specific model.
Below is an example of a non-parametric model i use with its function parameters.
elseif strcmpi(model,'jpa2')
% y = a.*(1-1/(1+(b_1(t+e))^c_1+(b_2(t+e))^c_2+(b_3(t+e))^c_3))
heightModel = #(params,ages) abs(params(1).*(1-1./(1+(params(2).* (ages+params(8) )).^params(5) +(params(3).* (ages+params(8) )).^params(6) +(params(4) .*(ages+params(8) )).^params(7) )));
modelStrings = {'a','b1','b2','b3','c1','c2','c3','e'};
% Define initial values
if strcmpi('male',gender)
initialValues = [176.76 0.339 0.1199 0.0764 0.42287 2.818 18.52 0.4363];
else
initialValues = [161.92 0.4173 0.1354 0.090 0.540 2.87 14.281 0.3701];
end
I've tried to mimick the curve fitting process in kaleidagraph as good as possible. There I've found they use the levenberg-marquardt algorithm which I've selected. However results still vary and I don't have any more clues about how I can change this.
Some extra adjustments:
The idea for this code was the following:
I'm trying to compare different fitting models (they are designed for this purpose). So what I do is I have 5 models with different parameters and different starting values ( the second part of my code) and next I have the general curve fitting file. Since there are different models it would be interesting if I could put restrictions into how far my starting values could wander off.
Anyone any idea how this could be done?
Anybody willing to help a psychology student?
Cheers
This is a common issue when dealing with non-linear models.
If I were, you, I would try to check if you can remove some parameters from the model in order to simplify it.
If you really want to keep your solution not too far from the initial point, you can use upper bounds and lower bounds for each variable:
x = lsqcurvefit(fun,x0,xdata,ydata,lb,ub)
defines a set of lower and upper bounds on the design variables in x so that the solution is always in the range lb ≤ x ≤ ub.
Cheers
You state:
I'm trying to compare different fitting models (they are designed for
this purpose). So what I do is I have 5 models with different
parameters and different starting values ( the second part of my code)
and next I have the general curve fitting file.
You will presumably compare the statistics from fits with different models, to see whether reductions in the fitting error are unlikely to be due to chance. You may want to rely on that comparison to pick the model that not only fits your data suitably but is also simplest (which is often referred to as the principle of parsimony).
The problem is really with the model you have shown resulting in correlated parameters and therefore overfitting, as mentioned by #David. Again, this should be resolved when you compare different models and find that some do just as well (statistically speaking) even though they involve fewer parameters.
edit
To drive the point home regarding the problem with the choice of model, here are (1) results of a trial fit using simulated data (2) the correlation matrix of the parameters in graphical form:
Note that absolute values of the correlation close to 1 indicate strongly correlated parameters, which is highly undesirable. Note also that the trend in the data is practically linear over a long portion of the dataset, which implies that 2 parameters might suffice over that stretch, so using 8 parameters to describe it seems like overkill.

How to get level of fitness of data to a distribution by using probplot() in Matlab?

I have 2 sets of data of float numbers, set A and set B. Both of them are matrices of size 40*40. I would like to find out which set is closer to the normal distribution. I know how to use probplot() in matlab to plot the probability of one set. However, I do not know how to find out the level of the fitness of the distribution is.
In python, when people use problot, a parameter ,R^2, shows how good the distribution of the data is against to the normal distribution. The closer the R^2 value to value 1, the better the fitness is. Thus, I can simply use the function to compare two set of data by their R^2 value. However, because of some machine problem, I can not use the python in my current machine. Is there such parameter or function similar to the R^2 value in matlab ?
Thank you very much,
Fitting a curve or surface to data and obtaining the goodness of fit, i.e., sse, rsquare, dfe, adjrsquare, rmse, can be done using the function fit. More info here...
The approach of #nate (+1) is definitely one possible way of going about this problem. However, the statistician in me is compelled to suggest the following alternative (that does, alas, require the statistics toolbox - but you have this if you have the student version):
Given that your data is Normal (not Multivariate normal), consider using the Jarque-Bera test.
Jarque-Bera tests the null hypothesis that a given dataset is generated by a Normal distribution, versus the alternative that it is generated by some other distribution. If the Jarque-Bera test statistic is less than some critical value, then we fail to reject the null hypothesis.
So how does this help with the goodness-of-fit problem? Well, the larger the test statistic, the more "non-Normal" the data is. The smaller the test statistic, the more "Normal" the data is.
So, assuming you have converted your matrices into two vectors, A and B (each should be 1600 by 1 based on the dimensions you provide in the question), you could do the following:
%# Build sample data
A = randn(1600, 1);
B = rand(1600, 1);
%# Perform JB test
[ANormal, ~, AStat] = jbtest(A);
[BNormal, ~, BStat] = jbtest(B);
%# Display result
if AStat < BStat
disp('A is closer to normal');
else
disp('B is closer to normal');
end
As a little bonus of doing things this way, ANormal and BNormal tell you whether you can reject or fail to reject the null hypothesis that the sample in A or B comes from a normal distribution! Specifically, if ANormal is 1, then you fail to reject the null (ie the test statistic indicates that A is probably drawn from a Normal). If ANormal is 0, then the data in A is probably not generated from a Normal distribution.
CAUTION: The approach I've advocated here is only valid if A and B are the same size, but you've indicated in the question that they are :-)