base_margin or init_score for catboost regressor - offset

I would like to use a CatBoost regressor for insurance applications (Poisson objective). As I need to fix the exposure, how can I set the offset of log_exposure? When using xgboost I use "base_margin", while for lightgbm I use the "init_score" params. Is there an equivalent in CatBoost?

Just use the "set_scale_and_bias(scale, bias)" method on your CatBoostRegressor model.
the bias parameter will set the offset of the model prediction results, while the scale parameter should be left as its default which is 1.
For your Insurance Poisson objective the bias should be set to log(exposure).
See more details here: CatBoost documentation

After looking on the documentation, I found a viable solution. The fit method of both the CatBoostRegressor and CatboostClassifier provides a baseline and a sample_weight parameter that can be directly use to set an offset (for prior exposure) or a sample weight (for severity modeling).
Btw, the optimal approach is to create Pools and providing there the specification of offset and weights:
freq_train_pool = Pool(data=freq_train_ds, label=claim_nmb_train.values,cat_features=xvars_cat,baseline=claim_model_offset_train.values)
freq_valid_pool = Pool(data=freq_valid_ds, label=claim_nmb_valid.values,cat_features=xvars_cat,baseline=claim_model_offset_valid.values)
freq_test_pool = Pool(data=freq_test_ds, label=claim_nmb_test.values,cat_features=xvars_cat,baseline=claim_model_offset_test.values)
Here the data parameters contain pd.DataFrame with the predictors only, the label one che actual number of claim, cat_features are character lists specifying the categorical terms and the baseline terms are the np.array of log exposure. It works.
Using Pools allows to provide evaluation sets in the fit method.

Related

How to make the dynamic model in Dymola agree with the steady-state design result?

Modelica modeling is the first principle modeling, so how to test the model and set an effective benchmark is important, for example, I could design a fluid network as my wish, but when building a dynamic simulation model, I need to know the detailed geometry structure and parameters to set up every piece of my model. Usually, I would build a steady-state model with simple energy and mass conservation laws, then design every piece of equipment based on the corresponding design manual, but when I put every dynamic component together, when simulation till steady-state, the result is different from the steady-state model more or less. So I was wondering if I should modify my workflow to make the dynamic model agree with the steady-state model. Any suggestions are welcome.
#dymola #modelica
To my understanding of the question, your parameter values are fixed and physically known. I would attempt the following approach as a heuristic to identify the (few) component(s) that one needs to carefully investigate in order to understand how they influence or violates the assumed first principles.
This is just as a first trial and it could be subject to further improvement and fine-tuning.
Consider the set of significant set of variables xd(p,t) \in R^n and parameters p. Note that p also includes significant start values. p in R^m includes only the set of additional parameters not available in the steady state model.
Denote the corresponding variables of the steady state model by x_s
Denote a time point where the dynamic model is "numerically" in "semi-" steady-state by t*
Consider the function C(xd(p,t*),xs) = ||D||^2 with D = xd(p,t*) - xs
It could be beneficial to describe C as a vector rather than a single valued function.
Compute the partial derivatives of C w.t. p expressed in terms of dxd/dp, i.e.
dC/dp = d[D^T D]/dp
= d[(x_d-x_s)^T (x_d - x_s)]/dp
= (dx_d/dp)^T D + ...
Consider scaling the above function, i.e. dC/dp * p/C (avoid expected numerical issues via some epsilon-tricks)
Here you get a ranking of most significant parameters which are causing the apparent differences. The hopefully few number of components including these parameters could be the ones causing such violation.
If this still does not help, may be due to expected high correlation among parameters, I would go further and consider a dummy parameter identification problem, out of which a more rigorous ranking of significant model parameters can be obtained.
If the Modelica language had capabilities for expressing dynamic parameter sensitivities, all the above computation can be easily carried out as a single Modelica model (with a slightly modified formulation).
For instance, if we had something like der(x,p) corresponding to dx/dp, one could simply state
dcdp = der(C,p)
An alternative approach is proposed via the DerXP library

Calculate number of parameters in neural network

I am wondering would the number of parameters in the models like ResNet18, Vgg16, and DenseNet201 would change if we change the input size to the model?
I did measure the number of parameters with the following command
pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
Also, I have tried this snippet, and the number of parameters did not change for different input size
import torchvision.models as models
model= models.resnet18(pretrained = False)
model.cuda()
summary(model, (1,64,64))
No it would not. Parameters of a model have the purpose of processing the input as it propagates inside the network pipeline.
The parameters are mostly trained to serve their purpose, which is defined by the training task. Consider a increase in number of parameters based on the input? What would their values be? Would they be random? How would this new parameters with new values affect the inference of the model?
Such a sudden, random change to the fine-tuned, well-trained parameters of the model would be impractical. Maybe there are some other algorithms that I am unaware of, that change their parameter collection based on input. But the architectures that have been mentioned in question do not support such functionality.
Traninable parameters do not change with the change in input. If you see the weights in first layer of the model with the command list(model.parameters())[0].shape you can realize that it does not depend on the height and width of the input, but it depends on the number of channels(e.g Gray, RGB, HyperSpectral), which usually is very insignificant in bigger models. For further information about getting the input shape, you can see this toy example.

ELKI clustering FDBSCAN algorithm

Please could you show me example of input file for FDBSCAN in ELKI. I got error like this:
Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: UncertainObject,field
Available types: DBID DoubleVector,dim=2
at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
at de.lmu.ifi.dbs.elki.algorithm.clustering.uncertain.FDBSCANNeighborPredicate.instantiate(FDBSCANNeighborPredicate.java:131)
at de.lmu.ifi.dbs.elki.algorithm.clustering.gdbscan.GeneralizedDBSCAN.run(GeneralizedDBSCAN.java:122)
at de.lmu.ifi.dbs.elki.algorithm.clustering.gdbscan.GeneralizedDBSCAN.run(GeneralizedDBSCAN.java:79)
at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
at [...]
FDBSCAN requires data of the type UncertainObject, i.e. objects with uncertainty information.
If you simply load a CSV file, the data will be certain, and you cannot use uncertain clustering.
There are several ways of modeling uncertainty. These implement as filters in the typeconversions package.
UncertainSplitFilter can split a vector of length k*N into k possible instances, each of length N with uniform weight.
WeightedUncertainSplitFilter is similar, but every instance can also have a weight associated.
UncertainifyFilter can simulate uncertainty by e.g. assuming a Gaussian or Uniform distribution around the original vector.
UniformUncertainifier (the U-Model, see Javadoc of UniformContinuousUncertainObject)
SimpleGaussianUncertainifier (see Javadoc of SimpleGaussianContinuousUncertainObject)
UnweightedDiscreteUncertainifier (BID Model, see Javadoc of WeightedDiscreteUncertainObject)
WeightedDiscreteUncertainifier (as above)
or add your own uncertainty information by extending the API!

RapidMiner: Ability to classify based off user set support threshold?

I am have built a small text analysis model that is classifying small text files as either good, bad, or neutral. I was using a Support-Vector Machine as my classifier. However, I was wondering if instead of classifying all three I could classify into either Good or Bad but if the support for that text file is below .7 or some user specified threshold it would classify that text file as neutral. I know this isn't looked at as the best way of doing this, I am just trying to see what would happen if I took a different approach.
The operator Drop Uncertain Predictions might be what you want.
After you have applied your model to some test data, the resulting example set will have a prediction and two new attributes called confidence(Good) and confidence(Bad). These confidences are between 0 and 1 and for the two class case they will sum to 1 for each example within the example set. The highest confidence dictates the value of the prediction.
The Drop Uncertain Predictions operator requires a min confidence parameter and will set the prediction to missing if the maximum confidence it finds is below this value (you can also have different confidences for different class values for more advanced investigations).
You could then use the Replace Missing Values operator to change all missing predictions to be a text value of your choice.

Mahout: Why is using setProbes() having this affect?

I'm using mahout 0.7 to do some classification. I have an encoder for a continuous variable
ContinuousValueEncoder durationPlanEncoder = new ContinuousValueEncoder("duration_plan");
The feature associated with this encoder is a number of days and can range from about 6 to 16.
I'm using an OnlineLogisticRegression model and I use the encoder to train it:
durationPlanEncoder.addToVector(null, <duration_plan double val>, trainDataVector);
For simplicity (since i'm trying to understand this whole classification thing while also learning Mahout), i am using 2 variables: 1) a categorical variable with 6 categories -- one of which ("dev") always predicts the =1 category; and 2) this "duration_plan" variable.
What i expect to find is that, when i give the classifier test data that consists of the category "dev" and a "duration_plan" value, the accuracy of the classifier will increase as the "duration_plan" value i give it gets closer to its average value across the training data. This is not what i'm seeing, however. Instead, the accuracy of the classifier improves as the value of "duration_plan" goes to 0.0. However -- there are no training vectors with duration_plan=0.0!! Why would this be the case?
Then i modified my durationPlanEncoder as follows:
durationPlanEncoder.setProbes(2);
and the accuracy improved. It got even better when i made the number of probes 20, then 200. Why? What is setProbes() doing and is this an anomaly or is this actually how i should be doing it?
The final part of my question is to mention that, even after setting setProbes(20), changing the value of "duration_plan" in the test data has no effect on the accuracy of the classifier -- which I don't think is how it should be. If i give a value for duration_plan that doesn't even exist in any of the training data and thus is never correlated with the =1 class, i would expect the classifier to classify the test sample as =0. Right? Which makes me think i must be coding something just plain wrong. Any suggestions are appreciated.
Mahout documentation is woefully sparse.
thanks.