The Matlab documentation seems unclear about how to ignore missing data when using kruskalwallis, the Kruskal-Wallis (or any other related) test. The same goes for unequal group size.
Very late answer, but I ran into the same problem myself today, might as well help some future searcher.
The solution is pretty straightforward. kruskalwallis is primarily used on matrices and by default compares equal-sized columns, but it does allow you to instead assign groups manually, with the optional variable "group". I was attempting to check whether a single value was unlikely to belong to a distribution from a different set, so this was straightforward. I just added the value I wanted to test on to the end of the set I was testing against, then made "group" a vector of ones the same size as the set, with a "2" added to the end for the new value. Looks like it worked quite nicely.
For numeric data, the the standard missing data value in Matlab is NaN. See ismissing. See also this article from The MathWorks. For tables, you might find standardizeMissing helpful as well as replaceWithMissing for dataset objects. I can't say anything about group size.
Related
One of the features in my random forest model has missing values. There are 5 reasons for the data is missing and I know the reason for all the missing values. My question is how can I feed this information into the model? I can create a categorical variable (or encoded dummies) for the reason of the data being missing but how can I make sure that random forest gets information from this categorical variable when there is a missing value in my main variable?
Adding another variable will not help you much, because 1) Random For
rest assumes independence of the variables, so you will not be able to entangle two variables and 2) it does not guarantee that it will use it all.
If you want to use Random Forest, you will have to impute the missing values one way or another.
The most simple approach is if your variable is in some range, set the missing values to an out of range values encoding the reasons. That is if your variable lays in range [-1..1], set the missing value (say) to -101 if the reason is reason #1, -102 for the reason #2, etc. The idea is to allow the algorithm find a distinct borders between different values.
Second method called MissForest is a bit more computationally complex. As you don't know the value, information about why you miss it does not contribute much. Still, you can find the best value to set instead of the missing one iteratively.
This is more of a formatting problem than code logic and probably seems silly (considering I've seen far more dense block diagrams). I'm working with a lot of numeric constants and they're starting to clutter my Block Diagram. Is there something I can use to group them nice and compactly?
Preferably I would like to avoid clustering them because I would need to bundle and unbundle every time I needed access.
EDIT: Picture of code in question (code segment is used repeatedly, so would be nice to have a more compact case structure)
I think you should rethink how much of your block diagram you expect to devote to constants :-)
Using numbers directly in code, the equivalent of unlabelled constants on the LabVIEW block diagram, is a recognised anti-pattern. Unless the reason for the constant value is both obvious and fundamental to the operation being carried out, anyone looking at your code (including you, any time after a couple of weeks since you wrote it) will not understand why the value was chosen. Therefore, you should make this clear by labelling the constant somehow (equivalent to assigning it to a name in a text language) and also make it easy to change the value if necessary.
It's usually clear what a 0 or 1 constant is doing there but in the code image you've posted you have two constants of 1000 and one of 999. Why is it 1000, and if I decide that it should be (say) 2000 instead, do I need to update the other two values as well? If so you should define it once, label it with a suitable name describing what it is (in your example it might be chunk size or something) and wire that value to wherever you need to use it. Where you have a constant 999 you could get that value with a Decrement function, or you could also change your Greater Than function to a Greater or Equal and compare directly with the 1000 value. In this way your initial constant definition will take up more space because of the label, but you'll save space and improve maintainability by wiring that value to wherever you need it rather than placing additional constants.
If you need to refer to the same constants in multiple places on your block diagram, you can place the constants (and just the constants, not any other program logic) in a subVI, with each constant wired to an indicator with a suitable label, and each indicator wired to a different output on the connector pane. When you hover the wiring tool over the SubVI's terminals you'll see the label in the tip-strip. Alternatively, especially if you need loads of different constant values, you can do the same thing but in your SubVI bundle the different constants into a named cluster (which you save as a typedef), and then use Unbundle by Name to access specific constant values from the cluster where you need them. Again this doesn't necessarily save block diagram space, but it does make your code more readable and maintainable.
Simple answer was to reorganize my block diagram making more space for the constants. Dave_St suggested creating subvi's for the case structures for anyone looking for alternatives. Wanted to mark this as resolved regardless.
New with Matlab.
When I try to load my own date using the NN pattern recognition app window, I can load the source data, but not the target (it is never on the drop down list). Both source and target are in the same directory. Source is 5000 observations with 400 vars per observation and target can take on 10 different values (recognizing digits). Any Ideas?
Before you do anything with your own data you might want to try out the example data sets available in the toolbox. That should make many problems easier to find later on because they definitely work, so you can see what's wrong with your code.
Regarding your actual question: Without more details, e.g. what your matrices contain and what their dimensions are, it's hard to help you. In your case some of the problems mentioned here might be similar to yours:
http://www.mathworks.com/matlabcentral/answers/17531-problem-with-targets-in-nprtool
From what I understand about nprtool your targets have to consist of a matrix with only one 1 (for the correct class) in either row or column (depending on the input matrix), so make sure that's the case.
I have vectors of data that I feed through the filter() function -- said filter was constructed to emit a reasonable approximation of the original signal that is then used to identify "bad" elements in the original data (said elements are typically caused by infrequent short-duration sensor malfunctions and are quite distinct from good data). After identifying these bad elements, I want to go back and replace them with something reasonable.
One approach would be to replace the bad values with the filtered output; however, the output was generated with the bad values, so it has some amount of undesired distortion.
Ideally, I'd like a way to tell filter() to assume that the bad element[s] are missing and that it should instead generate a reasonable interpolation of the missing value[s] (e.g., based on the surrounding values and the properties of the filter) for use when constructing the output.
I've been told that certain toolboxes allow insertion of special values (e.g., NaN) to indicate missing (but assumed to be well-behaved) data.
I looked at the source code for Octave's filter() and nothing obvious leapt out at me -- is there a special value (or other mechanism) to tell filter() to assume that well-behaved data is missing (and should be inserted as needed)?
Inserting NaN won't work for this. The filter function is pretty simple--it simply implements an IIR filter.
If your signal is smooth and slowly-changing, you might get away with simply using interp1 to interpolate new values for the bad stretches based on the good data on either side.
If your signal has more complicated spectral content, I think "Wiener interpolation" is the phase to google for. For extrapolation you can use linear predictive coding.
I have written matlab codes for two different block matching algorithms, extensive search and three step search, but i am not sure how i can check whether i am getting the correct results. Is there any standard way to check these or any standard code which i can run and compare my result with.I read somewhere that JM software can be used but i didnt find any way to use it.
You can always use the results produced by your algorithms to create the next frame of video and then analyze its quality by either visually inspecting it (which is rather subjective, and we like to deal in numbers) or calculating the mean square error between the produced image and the one you're trying to estimate. Mean square error of the exhaustive (extensive) search should be lower than the one three-step gives you.
Well, did you try to plot it? I mean,after the block-matching you have a new image, right?.
A way to know if you result if true or not is to check the sum of the difference of 2 frames.
A - pre_frame
B - post_frame
C - Compensated frame
If abs(abs(A-B)) is lower than abs(abs(A-C))) that mean it could be true.
Next time, try to specify your algoritm. Also, put your code here to help you more.