Beta constraints in H2OGeneralizedLinearEstimator - pyspark

I'm looking for a way to set beta in prior to the model run in H2O GeneralizedLinearEstimator? Beta which can be used as a starting point for the model? It is called beta constraints as per the documentation below. Could someone help me with this.
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/beta_constraints.html
Here is the example what I'm trying
1) Model 1: I ran a model with 20 iterations and used the betas from this model run and saved it as a data frame
2) Model 2: I ran a model with everything the same as the first Model (including 20 iterations), additionally I specified beta constraints as the coefficient from the previous model. Trying to warm start this model so it starts where the previous model ended.
3) Model 3: I ran a model with everything the same as the first model but with 40 iteration
As explained in the documentation I wanted to compare and see that the model 2 betas to be same as the betas from a model 3
Code with beta constraints specification
model2 = H2OGeneralizedLinearEstimator(family = "poisson",
alpha=0,
solver = "L-BFGS",
max_iterations=20,
gradient_epsilon=1e-8,
objective_epsilon=1e-8,
beta_epsilon=1e-8,
beta_constraints=bc)

According to the H2O Glossary, the format for beta_constraints is:
A data.frame or H2OParsedData object with the columns [“names”,
“lower_bounds”,”upper_bounds”, “beta_given”], where each row
corresponds to a predictor in the GLM. “names” contains the predictor
names, “lower_bounds” and “upper_bounds” are the lower and upper
bounds of beta, and “beta_given” is some supplied starting values for
beta.
First, you need to get the vector of betas from Model 2. Call them m2_betas.
Next, because you want strict equality constraints, you need to and set those to the upper bounds, lower bounds, and starting value.
I have done this in R (see related answer), where beta_constraints are passed as a data.frame but I assume that the Python API is similar and uses a pandas.DataFrame. Try:
constraints = pd.DataFrame({'names':x.columns,
'lower_bounds':m2_betas,
'upper_bounds':m2_betas,
'beta_given':m2_betas})

Related

No Model Summary For GLMs in Pyspark / SparkML

I'm familiarizing myself with Pyspark and SparkML at the moment. To do so I use the titanic dataset to train a GLM for predicting the 'Fare' in that dataset.
I'm following closely the Spark documentation. I do get a working model (which I call glm_fare) but when I try to assess the trained model using summary I get the following error message:
RuntimeError: No training summary available for this GeneralizedLinearRegressionModel
Why is this?
The code for training was as such:
glm_fare = GeneralizedLinearRegression(
labelCol="Fare",
featuresCol="features",
predictionCol='prediction',
family='gamma',
link='log',
weightCol='wght',
maxIter=20
)
glm_fit = glm_fare.fit(training_df)
glm_fit.summary
Just in case someone comes across this question, I ran into this problem as well and it seems that this error occurs when the Hessian matrix is not invertible. This matrix is used in the maximization of the likelihood for estimating the coefficients.
The matrix is not invertible if one of the eigenvalues is 0, which occurs when there is multicollinearity in your variables. This means that one of the variables can be predicted with a linear combination of the other variables. Consequently, the effect of each of the variables cannot be identified with any significance.
A possible solution would be to find the variables that are (multi)collinear and remove one of them from the regression. Note however that multicollinearity is only a problem if you want to interpret the coefficients and not when the model is used for prediction.
It is documented possibly there could be no summary available for a model in GeneralizedLinearRegressionModel docs.
However you can do an initial check to avoid the error:
glm_fit.hasSummary() which is a public boolean method.
Using it as
if glm_fit.hasSummary():
print(glm_fit.summary)
Here is a direct like to the Pyspark source code
and the GeneralizedLinearRegressionTrainingSummary class source code and where the error is thrown
Make sure your input variables for one hot encoder starts from 0.
One error I made that caused summary not created is, I put quarter(1,2,3,4) directly to one hot encoder, and get a vector of length 4, and one column is 0. I converted quarter to 0,1,2,3 and problem solved.

Simulink S-Functions - Retrieve Initial Values from another S-Function

I'm trying to model the respective processes of an internal combustion engine. My current modelling approach is to have different sub functions which model the different processes.
Within each sub function is a Level 2 S-Function which solves the ODEs to give the in cylinder state (pressure, temperature, etc).
The problem that I'm having is that each sub function is enabled depending on the current crank angle which is computed from the current timestep in Simulink. The first process works fine as I manually set the initial values, but then I can't pass the latest in-cylinder state (the output from the first sub function) to the second sub function to use as the initial conditions (it insists on using the initial values I set at the beginning of the simulation).
Is there any way round this? Currently I'm going along a path of global data stores, but haven't had any joy so far.
There are a lot of different ways to solve this problem.
I'll show some of them as examples.
You can create additive output with Unit dalay block like this:
So you can get value of your crank angle from previous timestep and USE IT in formula for solving you equations.
Also you can use some code like this:
if (t == 0)
% equations with your initial values
sred = 0;
else
% equations with other values
y = uOld + myCoeef;
end
Another idea: sometimes I use persistent variables in Matlab function to save values of some variable from previous step. But I think it makes calculation slower.
One more idea - if you have Stateflow you can create chart with two states: first for initial moment with your coefficient and second to solve new equations.
If I understood you in wrong way you can show your code and we'll offer some new ideas!
P.S. Example of my using of S-Function:
My S-Function needs 2 values: Q is calculated in simulink at every step, ro is initial I took from big matrix I loaded from workspace in table and took necessary value depending of time.
So there is no any initial values in S-Function - all needed values I transmit into it from simulink!

GMM in MATLAB gives different results for the same file

I constructed a Gaussian Mixture Model in Matlab with a dataset:
model = gmdistribution.fit(data,M,'Replicates',5);
with M = 3 Gaussian components. I tested new data with:
[P, l] = posterior(model,new_data);
I ran the program several times and didn't get the same result. Each run produces different log-likelihood values. I use the log-likelihood to make decisions, and this value for the same data (new_data) differs for each run. What does it depend on? How can I resolve this problem?
First, assuming that you're using a newish version of Matlab, the gmdistribution.fit documentation indicates that the fit method is deprecated and that fitgmdist should be used. See here for an example.
Second, the documentation for gmdistribution.fit indicates that if the 'Replicates' option is larger than 1, the 'randSample' start method will be used to produce the initial parameters. This may be the cause (or at least one of the causes) of your observed variability.
Finally, you can also try using rng before calling gmdistribution.fit to set the seed of the global random number stream (assuming the function doesn't use it's own stream internally). Alternatively, you can try specifying an 'Options' parameter via statset:
seed = 1;
s = RandStream('mt19937ar','Seed',seed);
opts = statset('Streams',s);
model = gmdistribution.fit(data,M,'Replicates',5,'Options',opts);
I can't test this fully myself – see the gmdistribution class documentation for further details.

function parameters in matlab wander off after curve fitting

first a little background. I'm a psychology student so my background in coding isn't on par with you guys :-)
My problem is as follow and the most important observation is that curve fitting with 2 different programs gives completly different results for my parameters, altough my graphs stay the same. The main program we have used to fit my longitudinal data is kaleidagraph and this should be seen as kinda the 'golden standard', the program I'm trying to modify is matlab.
I was trying to be smart and wrote some code (a lot at least for me) and the goal of that code was the following:
1. Taking an individual longitudinal datafile
2. curve fitting this data on a non-parametric model using lsqcurvefit
3. obtaining figures and the points where f' and f'' are zero
This all worked well (woohoo :-)) but when I started comparing the function parameters both programs generate there is a huge difference. The kaleidagraph program stays close to it's original starting values. Matlab wanders off and sometimes gets larger by a factor 1000. The graphs stay however more or less the same in both situations and both fit the data well. However it would be lovely if I would know how to make the matlab curve fitting more 'conservative' and more located near it's original starting values.
validFitPersons = true(nbValidPersons,1);
for i=1:nbValidPersons
personalData = data{validPersons(i),3};
personalData = personalData(personalData(:,1)>=minAge,:);
% Fit a specific model for all valid persons
try
opts = optimoptions(#lsqcurvefit, 'Algorithm', 'levenberg-marquardt');
[personalParams,personalRes,personalResidual] = lsqcurvefit(heightModel,initialValues,personalData(:,1),personalData(:,2),[],[],opts);
catch
x=1;
end
Above is a the part of the code i've written to fit the datafiles into a specific model.
Below is an example of a non-parametric model i use with its function parameters.
elseif strcmpi(model,'jpa2')
% y = a.*(1-1/(1+(b_1(t+e))^c_1+(b_2(t+e))^c_2+(b_3(t+e))^c_3))
heightModel = #(params,ages) abs(params(1).*(1-1./(1+(params(2).* (ages+params(8) )).^params(5) +(params(3).* (ages+params(8) )).^params(6) +(params(4) .*(ages+params(8) )).^params(7) )));
modelStrings = {'a','b1','b2','b3','c1','c2','c3','e'};
% Define initial values
if strcmpi('male',gender)
initialValues = [176.76 0.339 0.1199 0.0764 0.42287 2.818 18.52 0.4363];
else
initialValues = [161.92 0.4173 0.1354 0.090 0.540 2.87 14.281 0.3701];
end
I've tried to mimick the curve fitting process in kaleidagraph as good as possible. There I've found they use the levenberg-marquardt algorithm which I've selected. However results still vary and I don't have any more clues about how I can change this.
Some extra adjustments:
The idea for this code was the following:
I'm trying to compare different fitting models (they are designed for this purpose). So what I do is I have 5 models with different parameters and different starting values ( the second part of my code) and next I have the general curve fitting file. Since there are different models it would be interesting if I could put restrictions into how far my starting values could wander off.
Anyone any idea how this could be done?
Anybody willing to help a psychology student?
Cheers
This is a common issue when dealing with non-linear models.
If I were, you, I would try to check if you can remove some parameters from the model in order to simplify it.
If you really want to keep your solution not too far from the initial point, you can use upper bounds and lower bounds for each variable:
x = lsqcurvefit(fun,x0,xdata,ydata,lb,ub)
defines a set of lower and upper bounds on the design variables in x so that the solution is always in the range lb ≤ x ≤ ub.
Cheers
You state:
I'm trying to compare different fitting models (they are designed for
this purpose). So what I do is I have 5 models with different
parameters and different starting values ( the second part of my code)
and next I have the general curve fitting file.
You will presumably compare the statistics from fits with different models, to see whether reductions in the fitting error are unlikely to be due to chance. You may want to rely on that comparison to pick the model that not only fits your data suitably but is also simplest (which is often referred to as the principle of parsimony).
The problem is really with the model you have shown resulting in correlated parameters and therefore overfitting, as mentioned by #David. Again, this should be resolved when you compare different models and find that some do just as well (statistically speaking) even though they involve fewer parameters.
edit
To drive the point home regarding the problem with the choice of model, here are (1) results of a trial fit using simulated data (2) the correlation matrix of the parameters in graphical form:
Note that absolute values of the correlation close to 1 indicate strongly correlated parameters, which is highly undesirable. Note also that the trend in the data is practically linear over a long portion of the dataset, which implies that 2 parameters might suffice over that stretch, so using 8 parameters to describe it seems like overkill.

Modelica: use of der() in a custom class/model

I'm trying to learn Modelica and I have constructed a very simple model using the multibody library. the model consists of a world object and a body (mass) connected to to beams which are then connected to 2 extended PartialOneFrame_a classes (see below) which I modified to create a constant force in one axis. essentially all this group of objects does is fall under gravity and spin around due to the two forces acting at a longituidnal offset from the body center creating a couple about the cg.
model Constant_Force
extends Modelica.Mechanics.MultiBody.Interfaces.PartialOneFrame_a;
parameter Real force = 1.0;
equation
frame_a.f = {0.0,0.0,force};
frame_a.t = {0.0,0.0,0.0};
end Constant_Force;
I next wanted see if I could create a very simple aerodynamic force component which I would connect to the end of one the rotating 'arms'. My idea was to follow the example of the Constant_force model above and for my first simple cut generate forces based on the local frame velocities. This where my problem arose - I tried to compute the velocity using der(frame_a.r_0) which I was then going to transform to local frame using resolve2 function but adding the der(...) line caused the model to not work properly - it would 'successfully' simulate (using OpenModelica) but the v11b vector (see below) would be all zeros, so would der(frame_a.r_0) that appeared for plot plotting - not only that, all the other component behaviors were now also just zero constantly - (frame_a.r_0, w_a etc of the body).
model Aerosurf
extends Modelica.Mechanics.MultiBody.Interfaces.PartialOneFrame_a;
import Modelica.Mechanics.MultiBody.Frames;
import Modelica.SIunits;
//Real[3] v11b;
SIunits.Velocity v11b[3];
//initial equation
// v11b={0.0,0.0,0.0};
algorithm
//v11b:=der(frame_a.r_0);
equation
v11b=der(frame_a.r_0);
frame_a.f = {0.0,0.0,0.0};
frame_a.t = {0.0,0.0,0.0};
end Aerosurf;
I tried a number of ways just to simply compute the velocities (you will see from the commented lines) so i could plot to check correct behavior but to no avail. I used algorithm or equation approach - I did acheive some different (but also incorrect behaviours) with the different approaches.
Any tips? I must be missing something fundamental here, it seems the frame component does not inherently carry the velocity vector so I must have to compute it??
The simplest way is to use the Modelica.Mechanics.MultiBody.Sensors.AbsoluteVelocity block from MSL and connect it to your MB frame, then just use the variable of the output connector in your equation.