range of meaningful values for ranger importance using impurity_corrected - feature-selection

Using ranger R package, while obtaining importance using 'impurity_corrected' option, I get some importance values to be -ve.
My understanding was that the importance values would be from 0 to 1, with 0 meaning not important features (i.e. not adding any value)
Since I get -ve importance values, what do the -ve values mean ?
Negatively correlated ? Or worse than 0 (I don't know what that means)
Any reference papers / docs would be helpful
I did check : https://github.com/imbs-hl/ranger/blob/master/R/ranger.R and https://github.com/cran/ranger/blob/master/src/rangerCpp.cpp but could not really arrive at a conclusion.
The reference paper for this method : https://academic.oup.com/bioinformatics/article/34/21/3711/4994791 does not specify what negative values mean

Related

Sentinel-2 images OFFSET: Should negative resulting OFFSET values be 0 (NODATA) or 1?

I’m processing Sentinel-2 images outside of snap and have a more general question about the new OFFSET values in Baseline 04.00 that I hope you may help me understand.
While comparing images between the current Baseline 04.00 and earlier Baseline versions I subtract the BOA_ADD_OFFSET value or RADIO_OFFSET_VALUE from the Baseline 04.00 images depending on product. This sometimes leads to, as expected, negative values in the resulting raster.
My question is: What is the correct way to present these values if clamping them to a floor of 0? Should they be regarded as NODATA (value 0) - Implying missing or faulty data?
Or should they be viewed as a minimum value (like 1)? The reasoning being that these have a registered value that is out of bounds, but assumed to be approximately 0.
I’d be grateful if you can enlightnen me on how to best reason about the negative values or refer me to a source that better describes how the offset works as the Sentinel-2 Products Specification Document doesn’t supply enough context for my understanding.
The answer has now been answered by Jan on the ESA Step Forum:
https://forum.step.esa.int/t/info-introduction-of-additional-radiometric-offset-in-pb04-00-products/35431
I would suggest that they should be regarded as NO DATA (value 0).
This would at least flag them as unreliable. This is based on my
understanding of negative reflectance values being linked to the
Atmospheric Correction processing step. And thus they are an unwanted
contribution to the output. So making them 0 identifies them as
non-nominal, and helps serve as an avoidance measure. Or a flag of
doubt. Giving them a value of 1 would potentially attribute them some
relevance, and may serve to confuse.

I am using generalised mixed effects models family = binomial to determine difference between disturbed and undistubed substrate

I am using a GLMM model to determine differences in soil compaction across 3 locations and 2 seasons in undisturbed and disturbed sites. I used location and seas as random effects. My teacher says to use the compaction reading divided by its upper bound as the Y value against the different sites (fixed effect). (I was previously using disturbed and undisturbed sites as 1,0 as Y against the compaction reading - so the opposite way around.) The random variables are minimal. I was using both glmer (glmer to determine AIC and therefore best model fit (but this cannot be done in glmmPQL)) while glmmPQL provides all amounts of variation which glmer does not. So while these outcomes are very similar when using disturbed and undisturbed as Y (as well as matching the graphs) only glmmPQL is similar to the graphs when using proportion of compaction reading. glmer using proportions is totally different. Additionally my teacher says I need to validate my model choice with a chi-squared value and if over-dispersed use a quasi binomial. But I cannot find any way to do this in glmmPQL and with glmer showing strange results using proportions as Y I am unsure if this is correct. I also cannot use quasi binomial in either glmer or glmmPQL.
My response was the compaction reading which is measured from 0 to 6 (kg per cm squared) inclusive. The explanatory variable was Type (diff soil either disturbed and not disturbed = 4 categories because some were artificially disturbed to pull out differences). All compaction readings were divided by 6 to make them a proportion and so a continuous variable bounded by 0 and 1 but containing values of both 0 and 1. (I also tried the reverse and coded disturbed as 1 and undisturbed as 0 and compared these groups separately across all Types (due to 4 Types) and left compaction readings as original). Using glmer with code:
model1 <- glmer(comp/6 ~ Type +(1|Loc/Seas), data=mydata,
family = "binomial")
model2 <- glmer(comp/6~Type +(1|Loc) , data=mydata, family="binomial")
and using glmmPQL:
mod1 <-glmmPQL(comp/6~Type, random=~1|Loc, family = binomial, data=mydata)
mod2 <- glmmPQL(comp/6~Type, random=~1|Loc/Seas, family = binomial, data=mydata)
I could compare models in glmer but not in glmmPQL but the latter gave me the variance for all random effects plus residual variance whereas glmer did not provide the residual variance (so was left wondering what was the total amount and what proportion were these random effects responsible for)
When I used glmer this way, the results were completely different to glmmPQL as in no there was no sig difference at all in glmer but very much a sig diff in glmmPQL. (However if I do the reverse and code by disturbed and undisturbed these do provide similar results between glmer and glmmPQL and what is suggested by the graphs - but my supervisor said this is not strictly correct (eg: mod3 <- glmmPQL(Status~compaction, random=~1|Loc/Seas, family = binomial, data=mydata) where Status is 1 or 0 according to disturbed or undisturbed) plus my supervisor would like me to provide a chi squared goodness of fit for the model chosen - so can only use glmer here ?). Additionally, the random effects variance is minimal, and glmer model choice removes these as non significant (although keeping one in provides a smaller AIC). Removing them (as suggested by the chi-squared test (but not AIC) and running as only a glm is consistent to both results from glmmPQL and what is observed on the graph. Sorry if this seems very pedantic, but I am trying to do what is correct for my supervisor and for the species I am researching. I know there are differences.. they are seen, observed, eyeballing the data suggests so and so do the graphs.. Maybe I should just run the glm ? Thank you for answering me. I will find some output to post.

How to check two results are comparable or not using K-S test?

I have used two algorithms(methods) on few datasets and obtained some results. Now I want to check whether the obtained results are comparable or not? I have used two sampled K-S test and got the following result, now how to interpret the test results? and what should be the conclusion?
KstestResult(statistic=0.11320754716981132, pvalue=0.8906908896753045)
D_alpha= 0.2641897545611759
The K-S test is used to compare distributions. So, when you apply it, you are interested in comparing distribution say F(X) with another distribution $G(X)$. Typically, the former is your data distribution and the latter is either another data distribution or something you specify (e.g. Gaussian).
Now, the null hypothesis of the test is that F(X)=G(X). The test produces a statistic to asses whether this null hypothesis can be assumed to be true. You do not typically look directly at this quantity; rather you want to look at the p-value. This is the probability of observing a statistic greater than the one you observed, if the null hypothesis were true. In this case, a low value of the p-value indicates that it is veri unlikely (i.e. low probability) to observe a statistic greater than the one you observed if the hypothesis were true; thus, since you observed a statistic which gives you a small p-value, you tend to believe that your null hypothesis must be false, and that indeed F(X) is different from G(X).
In your case the p-value is high (for low p-value, we intend 0.1, 0.05 or 0.01 or lower), so you do not rejecte the null hypothesis. Thus, you would say that the things that you tested are the same and are not statistically different.
However, I strongly encourage you to read more about the test, how it is used and when; also try to understand if it is appropriate in your case. You can do it on Wikipedia and on the scipy docs

What is the actual meaning implied by information gain in data mining?

Information Gain= (Information before split)-(Information after split)
Information gain can be found by above equation. But what I don't understand is what is exactly the meaning of this information gain? Does it mean that how much more information is gained or reduced by splitting according to the given attribute or something like that???
Link to the answer:
https://stackoverflow.com/a/1859910/740601
Information gain is the reduction in entropy achieved after splitting the data according to an attribute.
IG = Entropy(before split) - Entropy(after split).
See http://en.wikipedia.org/wiki/Information_gain_in_decision_trees
Entropy is a measure of the uncertainty present. By splitting the data, we are trying to reduce the entropy in it and gain information about it.
We want to maximize the information gain by choosing the attribute and split point which reduces the entropy the most.
If entropy = 0, then there is no further information which can be gained from it.
Correctly written it is
Information-gain = entropy-before-split - average entropy-after-split
the difference of entropy vs. information is the sign. Entropy is high, if you do not have much information of the data.
The intuition is that of statistical information theory. The rough idea is: how many bits per record do you need to encode the class label assignment? If you have only one class left, you need 0 bits per record. If you have a chaotic data set, you will need 1 bit for every record. And if the class is unbalanced, you could get away with less than that, using a (theoretical!) optimal compression scheme; e.g. by encoding the exceptions only. To match this intuition, you should be using the base 2 logarithm, of course.
A split is considered good, if the branches have lower entropy on average afterwards. Then you have gained information on the class label by splitting the data set. The IG value is the average number of bits of information you gained for predicting the class label.

Efficient Function to Map (or Hash) Integers and Integer Ranges into Index

We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.