One of the features in my random forest model has missing values. There are 5 reasons for the data is missing and I know the reason for all the missing values. My question is how can I feed this information into the model? I can create a categorical variable (or encoded dummies) for the reason of the data being missing but how can I make sure that random forest gets information from this categorical variable when there is a missing value in my main variable?
Adding another variable will not help you much, because 1) Random For
rest assumes independence of the variables, so you will not be able to entangle two variables and 2) it does not guarantee that it will use it all.
If you want to use Random Forest, you will have to impute the missing values one way or another.
The most simple approach is if your variable is in some range, set the missing values to an out of range values encoding the reasons. That is if your variable lays in range [-1..1], set the missing value (say) to -101 if the reason is reason #1, -102 for the reason #2, etc. The idea is to allow the algorithm find a distinct borders between different values.
Second method called MissForest is a bit more computationally complex. As you don't know the value, information about why you miss it does not contribute much. Still, you can find the best value to set instead of the missing one iteratively.
Related
I would like to apply a cluster algorithm to my data frame, however I have some nominal scaled variables. Consequently, I would like to apply one-hot encoding so that I can also use, for example, k-means clustering. I'm aware, that there are also other and maybe also better algorithms than k-means, however I want to start with this and use the results as benchmark.
There are several possibilities, e.g. the packages Caret and Recipes offer functions for this. However, these require the definition of a target variable, which then no longer appears in the data frame. Although I theoretically have a target variable in my data set, I would rather keep it as a predictor and overweight it, so that the different clusters contain only one instance of the target variable. Consequently, I need to select another variable and specify it as the target variable in the formula interface.
I would therefore like to ask whether it then doesn't matter which variable one takes for this or whether I actually have to take my actual target variable and can still weight it somehow afterwards.
I've also seen a method there no target variable is defined in the formula interface. Is this a recommandable approach or is it preferred to define a target variable?
I would be very happy about an answer!
Many greetings and thanks in advance!
This is more of a formatting problem than code logic and probably seems silly (considering I've seen far more dense block diagrams). I'm working with a lot of numeric constants and they're starting to clutter my Block Diagram. Is there something I can use to group them nice and compactly?
Preferably I would like to avoid clustering them because I would need to bundle and unbundle every time I needed access.
EDIT: Picture of code in question (code segment is used repeatedly, so would be nice to have a more compact case structure)
I think you should rethink how much of your block diagram you expect to devote to constants :-)
Using numbers directly in code, the equivalent of unlabelled constants on the LabVIEW block diagram, is a recognised anti-pattern. Unless the reason for the constant value is both obvious and fundamental to the operation being carried out, anyone looking at your code (including you, any time after a couple of weeks since you wrote it) will not understand why the value was chosen. Therefore, you should make this clear by labelling the constant somehow (equivalent to assigning it to a name in a text language) and also make it easy to change the value if necessary.
It's usually clear what a 0 or 1 constant is doing there but in the code image you've posted you have two constants of 1000 and one of 999. Why is it 1000, and if I decide that it should be (say) 2000 instead, do I need to update the other two values as well? If so you should define it once, label it with a suitable name describing what it is (in your example it might be chunk size or something) and wire that value to wherever you need to use it. Where you have a constant 999 you could get that value with a Decrement function, or you could also change your Greater Than function to a Greater or Equal and compare directly with the 1000 value. In this way your initial constant definition will take up more space because of the label, but you'll save space and improve maintainability by wiring that value to wherever you need it rather than placing additional constants.
If you need to refer to the same constants in multiple places on your block diagram, you can place the constants (and just the constants, not any other program logic) in a subVI, with each constant wired to an indicator with a suitable label, and each indicator wired to a different output on the connector pane. When you hover the wiring tool over the SubVI's terminals you'll see the label in the tip-strip. Alternatively, especially if you need loads of different constant values, you can do the same thing but in your SubVI bundle the different constants into a named cluster (which you save as a typedef), and then use Unbundle by Name to access specific constant values from the cluster where you need them. Again this doesn't necessarily save block diagram space, but it does make your code more readable and maintainable.
Simple answer was to reorganize my block diagram making more space for the constants. Dave_St suggested creating subvi's for the case structures for anyone looking for alternatives. Wanted to mark this as resolved regardless.
Why arc4random_uniform can be set as a constant? I noticed that in multiple examples.
arc4random_uniform generates a new value every time it is called and I thought value of a constant should never change? It looks much more like a variable.
This is confusing.
You can make a thing a constant. In this case, the thing is a function. You're setting up a constant reference to a function. Think of arc4random_uniform() as a random number factory. It is constant. It sits there, waiting to create random numbers, just like a car factory, sits their waiting to create cars.
When you call the random number factory, it gives you a new random number, but the thing itself, the random number factory, remains constant.
Edit:
If you had a variable reference to a function instead, then you could store different random number factories in that variable. Each one might have different performance characteristics (speed, true randomness of the results, range of results, etc.)
The Matlab documentation seems unclear about how to ignore missing data when using kruskalwallis, the Kruskal-Wallis (or any other related) test. The same goes for unequal group size.
Very late answer, but I ran into the same problem myself today, might as well help some future searcher.
The solution is pretty straightforward. kruskalwallis is primarily used on matrices and by default compares equal-sized columns, but it does allow you to instead assign groups manually, with the optional variable "group". I was attempting to check whether a single value was unlikely to belong to a distribution from a different set, so this was straightforward. I just added the value I wanted to test on to the end of the set I was testing against, then made "group" a vector of ones the same size as the set, with a "2" added to the end for the new value. Looks like it worked quite nicely.
For numeric data, the the standard missing data value in Matlab is NaN. See ismissing. See also this article from The MathWorks. For tables, you might find standardizeMissing helpful as well as replaceWithMissing for dataset objects. I can't say anything about group size.
I have vectors of data that I feed through the filter() function -- said filter was constructed to emit a reasonable approximation of the original signal that is then used to identify "bad" elements in the original data (said elements are typically caused by infrequent short-duration sensor malfunctions and are quite distinct from good data). After identifying these bad elements, I want to go back and replace them with something reasonable.
One approach would be to replace the bad values with the filtered output; however, the output was generated with the bad values, so it has some amount of undesired distortion.
Ideally, I'd like a way to tell filter() to assume that the bad element[s] are missing and that it should instead generate a reasonable interpolation of the missing value[s] (e.g., based on the surrounding values and the properties of the filter) for use when constructing the output.
I've been told that certain toolboxes allow insertion of special values (e.g., NaN) to indicate missing (but assumed to be well-behaved) data.
I looked at the source code for Octave's filter() and nothing obvious leapt out at me -- is there a special value (or other mechanism) to tell filter() to assume that well-behaved data is missing (and should be inserted as needed)?
Inserting NaN won't work for this. The filter function is pretty simple--it simply implements an IIR filter.
If your signal is smooth and slowly-changing, you might get away with simply using interp1 to interpolate new values for the bad stretches based on the good data on either side.
If your signal has more complicated spectral content, I think "Wiener interpolation" is the phase to google for. For extrapolation you can use linear predictive coding.