Can you implement an offset in h2o.gbm with a multinomial response (K > 2)? - offset

My goal is take the outcome class probability predictions from another model (or wherever, really), and use them as an offset in h2o.gbm with distribution = "multinomial".
I noticed in the nnet package, the multinom() function allows for an offset with as many columns as there are outcome classes (K). Does something like this exist for h2o GBM's?

No, offsets for multinomial GBM is currently not supported, mostly because of the API implications (offset_column would change semantics everywhere in the code), but it wouldn't be hard to implement otherwise.
The only option right now is to use the offset columns as additional predictors.

Related

I am using generalised mixed effects models family = binomial to determine difference between disturbed and undistubed substrate

I am using a GLMM model to determine differences in soil compaction across 3 locations and 2 seasons in undisturbed and disturbed sites. I used location and seas as random effects. My teacher says to use the compaction reading divided by its upper bound as the Y value against the different sites (fixed effect). (I was previously using disturbed and undisturbed sites as 1,0 as Y against the compaction reading - so the opposite way around.) The random variables are minimal. I was using both glmer (glmer to determine AIC and therefore best model fit (but this cannot be done in glmmPQL)) while glmmPQL provides all amounts of variation which glmer does not. So while these outcomes are very similar when using disturbed and undisturbed as Y (as well as matching the graphs) only glmmPQL is similar to the graphs when using proportion of compaction reading. glmer using proportions is totally different. Additionally my teacher says I need to validate my model choice with a chi-squared value and if over-dispersed use a quasi binomial. But I cannot find any way to do this in glmmPQL and with glmer showing strange results using proportions as Y I am unsure if this is correct. I also cannot use quasi binomial in either glmer or glmmPQL.
My response was the compaction reading which is measured from 0 to 6 (kg per cm squared) inclusive. The explanatory variable was Type (diff soil either disturbed and not disturbed = 4 categories because some were artificially disturbed to pull out differences). All compaction readings were divided by 6 to make them a proportion and so a continuous variable bounded by 0 and 1 but containing values of both 0 and 1. (I also tried the reverse and coded disturbed as 1 and undisturbed as 0 and compared these groups separately across all Types (due to 4 Types) and left compaction readings as original). Using glmer with code:
model1 <- glmer(comp/6 ~ Type +(1|Loc/Seas), data=mydata,
family = "binomial")
model2 <- glmer(comp/6~Type +(1|Loc) , data=mydata, family="binomial")
and using glmmPQL:
mod1 <-glmmPQL(comp/6~Type, random=~1|Loc, family = binomial, data=mydata)
mod2 <- glmmPQL(comp/6~Type, random=~1|Loc/Seas, family = binomial, data=mydata)
I could compare models in glmer but not in glmmPQL but the latter gave me the variance for all random effects plus residual variance whereas glmer did not provide the residual variance (so was left wondering what was the total amount and what proportion were these random effects responsible for)
When I used glmer this way, the results were completely different to glmmPQL as in no there was no sig difference at all in glmer but very much a sig diff in glmmPQL. (However if I do the reverse and code by disturbed and undisturbed these do provide similar results between glmer and glmmPQL and what is suggested by the graphs - but my supervisor said this is not strictly correct (eg: mod3 <- glmmPQL(Status~compaction, random=~1|Loc/Seas, family = binomial, data=mydata) where Status is 1 or 0 according to disturbed or undisturbed) plus my supervisor would like me to provide a chi squared goodness of fit for the model chosen - so can only use glmer here ?). Additionally, the random effects variance is minimal, and glmer model choice removes these as non significant (although keeping one in provides a smaller AIC). Removing them (as suggested by the chi-squared test (but not AIC) and running as only a glm is consistent to both results from glmmPQL and what is observed on the graph. Sorry if this seems very pedantic, but I am trying to do what is correct for my supervisor and for the species I am researching. I know there are differences.. they are seen, observed, eyeballing the data suggests so and so do the graphs.. Maybe I should just run the glm ? Thank you for answering me. I will find some output to post.

base_margin or init_score for catboost regressor

I would like to use a CatBoost regressor for insurance applications (Poisson objective). As I need to fix the exposure, how can I set the offset of log_exposure? When using xgboost I use "base_margin", while for lightgbm I use the "init_score" params. Is there an equivalent in CatBoost?
Just use the "set_scale_and_bias(scale, bias)" method on your CatBoostRegressor model.
the bias parameter will set the offset of the model prediction results, while the scale parameter should be left as its default which is 1.
For your Insurance Poisson objective the bias should be set to log(exposure).
See more details here: CatBoost documentation
After looking on the documentation, I found a viable solution. The fit method of both the CatBoostRegressor and CatboostClassifier provides a baseline and a sample_weight parameter that can be directly use to set an offset (for prior exposure) or a sample weight (for severity modeling).
Btw, the optimal approach is to create Pools and providing there the specification of offset and weights:
freq_train_pool = Pool(data=freq_train_ds, label=claim_nmb_train.values,cat_features=xvars_cat,baseline=claim_model_offset_train.values)
freq_valid_pool = Pool(data=freq_valid_ds, label=claim_nmb_valid.values,cat_features=xvars_cat,baseline=claim_model_offset_valid.values)
freq_test_pool = Pool(data=freq_test_ds, label=claim_nmb_test.values,cat_features=xvars_cat,baseline=claim_model_offset_test.values)
Here the data parameters contain pd.DataFrame with the predictors only, the label one che actual number of claim, cat_features are character lists specifying the categorical terms and the baseline terms are the np.array of log exposure. It works.
Using Pools allows to provide evaluation sets in the fit method.

H2O.ai mini_batch_size is really used?

In the documentation of H2O is written:
mini_batch_size: Specify a value for the mini-batch size. (Smaller values lead to a better fit; larger values can speed up and generalize better.)
but when I run a model using the FLOW UI (with mini_batch_size > 1) in the log file is written:
WARN: _mini_batch_size Only mini-batch size = 1 is supported right now.
so the question: is the mini_batch_size really used??
It appears to be a left over from preparation for a DeepWater integration that never happened. E.g. https://github.com/h2oai/h2o-3/search?l=Java&p=2&q=mini_batch_size
That makes sense, because the Hogwild! algorithm, that H2O's deep learning uses, does away with the need for batching training data.
To sum up, I don't think it is used.

ELKI clustering FDBSCAN algorithm

Please could you show me example of input file for FDBSCAN in ELKI. I got error like this:
Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: UncertainObject,field
Available types: DBID DoubleVector,dim=2
at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
at de.lmu.ifi.dbs.elki.algorithm.clustering.uncertain.FDBSCANNeighborPredicate.instantiate(FDBSCANNeighborPredicate.java:131)
at de.lmu.ifi.dbs.elki.algorithm.clustering.gdbscan.GeneralizedDBSCAN.run(GeneralizedDBSCAN.java:122)
at de.lmu.ifi.dbs.elki.algorithm.clustering.gdbscan.GeneralizedDBSCAN.run(GeneralizedDBSCAN.java:79)
at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
at [...]
FDBSCAN requires data of the type UncertainObject, i.e. objects with uncertainty information.
If you simply load a CSV file, the data will be certain, and you cannot use uncertain clustering.
There are several ways of modeling uncertainty. These implement as filters in the typeconversions package.
UncertainSplitFilter can split a vector of length k*N into k possible instances, each of length N with uniform weight.
WeightedUncertainSplitFilter is similar, but every instance can also have a weight associated.
UncertainifyFilter can simulate uncertainty by e.g. assuming a Gaussian or Uniform distribution around the original vector.
UniformUncertainifier (the U-Model, see Javadoc of UniformContinuousUncertainObject)
SimpleGaussianUncertainifier (see Javadoc of SimpleGaussianContinuousUncertainObject)
UnweightedDiscreteUncertainifier (BID Model, see Javadoc of WeightedDiscreteUncertainObject)
WeightedDiscreteUncertainifier (as above)
or add your own uncertainty information by extending the API!

Shannon's Entropy measure in Decision Trees

Why is Shannon's Entropy measure used in Decision Tree branching?
Entropy(S) = - p(+)log( p(+) ) - p(-)log( p(-) )
I know it is a measure of the no. of bits needed to encode information; the more uniform the distribution, the more the entropy. But I don't see why it is so frequently applied in creating decision trees (choosing a branch point).
Because you want to ask the question that will give you the most information. The goal is to minimize the number of decisions/questions/branches in the tree, so you start with the question that will give you the most information and then use the following questions to fill in the details.
For the sake of decision trees, forget about the number of bits and just focus on the formula itself. Consider a binary (+/-) classification task where you have an equal number of + and - examples in your training data. Initially, the entropy will be 1 since p(+) = p(-) = 0.5. You want to split the data on an attribute that most decreases the entropy (i.e., makes the distribution of classes least random). If you choose an attribute, A1, that is completely unrelated to the classes, then the entropy will still be 1 after splitting the data by the values of A1, so there is no reduction in entropy. Now suppose another attribute, A2, perfectly separates the classes (e.g., the class is always + for A2="yes" and always - for A2="no". In this case, the entropy is zero, which is the ideal case.
In practical cases, attributes don't typically perfectly categorize the data (the entropy is greater than zero). So you choose the attribute that "best" categorizes the data (provides the greatest reduction in entropy). Once the data are separated in this manner, another attribute is selected for each of the branches from the first split in a similar manner to further reduce the entropy along that branch. This process is continued to construct the tree.
You seem to have an understanding of the math behind the method, but here is a simple example that might give you some intuition behind why this method is used: Imagine you are in a classroom that is occupied by 100 students. Each student is sitting at a desk, and the desks are organized such there are 10 rows and 10 columns. 1 out of the 100 students has a prize that you can have, but you must guess which student it is to get the prize. The catch is that everytime you guess, the prize is decremented in value. You could start by asking each student individually whether or not they have the prize. However, initially, you only have a 1/100 chance of guessing correctly, and it is likely that by the time you find the prize it will be worthless (think of every guess as a branch in your decision tree). Instead, you could ask broad questions that dramatically reduce the search space with each question. For example "Is the student somewhere in rows 1 though 5?" Whether the answer is "Yes" or "No" you have reduced the number of potential branches in your tree by half.