I cannot reproduce the results with kmeans in Orange

I cannot reproduce the results with kmeans in Orange - orange

I've tried to repeat the same results with the same flow, and I don't understand the results are different in each situation.
I describe the situation I have a file with 192 instances and 37 features, y select in all cases the same columns and preprocess by Median and StdDev. It computes the PCA with 7 principal components. The following step is to run the k-means algorithm (k is between 2 and 8) from this 'new' dataset. The scatter plot shows the results for k=5.
I attached different images with my flows.
Image1: original flow
The first one is the original flow (it is painted of yellow color), which I would like to repeat without the rest of the options (the second image).
Image2: flows repeated
However, when I tried to do it, I saw that the results are different (the third image) Of course the colors don't determine the differences, however the clusters are different. In addition the Slhouette Scores are different too for the different flows.
Image3: results of the different flows
K-means initializes with the kmean++ and I have the question if I can "control" this, or if the way to initialize k-means is always randomly. I saw in other programmes that there is an option called seed which is used to control that an experiment can be repeated but I didn't see this option here or something similar.
I wonder if it is possible to obtain always the same results with the same flow (using k-means).

It seems that the issue happens because no random seed is set in the k-means widget. So initialization is different each time you repeat an experiment and because of nature of your data, the method converges differently. Can you please report your issue to Orange3 issue tracker.

Related

Structuring an experiment with psychtoolbox

I have a design for an experiment that I would like to code in psychtoolbox using MATLAB. I understand the basics of how to use this toolbox however it would be extremely helpful if someone has designed a similar experiment before and could provide me with some code that could help me to carry out the following:
The experimental procedure will be composed of 80 trials divided into 5 blocks with 16 trials in each block. The experiment consists of the participant selecting a number from the screen. There is a desirable number (target number) and an associated less desirable number (lure number). I won't go into further detail about the reasoning behind this experiment as it is not relevant to my question.
I have attached an image that shows 1 block of trials (16 trials). The other 4 blocks are the same as this block.
Target and lure numbers will be presented on the screen to choose from (an example can be seen in image below).
In some of the trials as can be seen from the trials table only one target number and one lure number are presented for the participant to choose from (instead of two targets and two lures).
The lure(s) that appears with each target(s) should not always be the same. I want the lure that is shown with the target to be randomly selected in each trial (as can be seen in the trials image there is more than one possible lure). In the trials image I have attached the trial numbers are presented just for clarity, in each block the presentation of the targets needs to be randomized.

You can use the BalanceTrials function that comes with psychtoolbox. You use all the possible lures and targets as inputs and it returns a random order of all possible combinations. You can also specify a minimum length of the list it returns, but if there are more combinations it will make the list longer to make it balanced. Here is an example:
numberOfTrials = 80;
targetNumbers = {'3','4','5','6','7','4 5','4 6','4 7'};
lureNumbers = {'3','4','5','6','7','4 7'};
[targets, lures] = BalanceTrials(numberOfTrials, 1, targetNumbers, lureNumbers);
You can split this up into 5 blocks or you do it each time for each block.

AnyLogic Sensitivity analysis visualization

I am new to this and lost about how to create visualization for the AnyLogic sensitivity analysis. Here is the summary:
I have dataset that captures dependent and independent variables at the end of simulation run (just captures one pair at the end). Trying to vary independent variable to see the impact on dependent variable. The resulting dataset is correct (when I copy and paste in Excel) but the chart looks blank.
Also, output states that it completed 5 iterations but I specified 10 and the data shows that there were in fact 10 iterations.
There are no parameters for the chart data (see the screenshot) but I am guessing it is automatically populated at the end simulation based on the code (also copied below)? Otherwise, I cannot figure out what goes into the chart data (tried to manually enter variables/datasets to no avail).
This is the code after each simulation runs:
Color color = lerpColor( (getCurrentIteration() - 1) / (double) (getMaximumIterations() - 1), blue, red );
chart0.addDataSet( root.died_friend, format( root.SocFriendBrave ), color, true, `Chart.INTERPOLATION_LINEAR, 1, Chart.POINT_NONE );`
I realize this is a very basic question but I am lost and cannot get on the right path based on the help I found. thank you.

Let's say you have 2 experiments: simulation (normal one) and sensitivity analysis (new one)
The only way for the chart to look blank is if your dataset died_friend is empty. This can happen for many reasons, but the reason I suspect it happens here, is because you fill this dataset with information at the end of the simulation run, which means that you are probably using the java actions of the simulation experiment.
The sensitivity experiment DOES NOT read what you write on the java actions of the simulation experiment, so that might be the problem.
If this is not the case, you need to check other possible reasons on why your dataset is empty when you run the sensitivity analysis.
Remember: the fact that your dataset has data in your simulation experiment, does not necessarily mean the dataset will not be empty on the sensitivity experiment.

Since the Sensitivity Analysis experiment is set up via the wizard, it would be better if you showed us how you'd set that up, rather than the resultant AnyLogic-generated code.
From what you've said here, it looks like you may be incorrectly trying to use datasets instead of scalar values. If there's one 'dependent variable' (model output of interest) then the 'independent variable' needs to be a model parameter (parameter of your top-level agent, normally Main, which the experiment will be varying).
So you should specify:
Varying the relevant parameter (looks like you wanted it to be 0, 0.25, 0.5, 0.75, 1 so min 0, max 1 with step size of 0.25)
Producing a chart for the scalar output you want. (If this was, say, a variable called outputValue, you'd use expression root.outputValue when setting up the chart in the wizard.)
Also you should only be getting more than 5 iterations if you set up your parameter variation criteria incorrectly.
(Dataset outputs in the experiment's charts are typically for where your model produces a time series as an output --- i.e., a dataset of sim time vs. value --- and you want the Sensitivity Analysis experiment to show charts with each run's time series as separate lines (i.e., time on the X-axis). Scalar charts are for where the X-axis is the varying parameter and the Y-axis the output of interest.)

Boxplot is broken, only showing one line

so my data centres around different treatments and how they impact the day of germination. image of dodgy boxplot data
A while ago whilst making violin plots in R to show the distribution of when germination occurs according to treatment, I attempted to add a boxplot as a descriptive statistic and was met with only one line.
I contacted many people who simply had no idea what the issue was, I used this same data in another violin plot as part of a bigger data collection with more treatments including this one.
I moved on from this and found it odd, now when I have come to perform stats tests in SPSS, I have the same problem as imaged below. When I try a Mann Whitney U test I am told "cannot compute" due to not having solely two variables, when I try a Kruskal Wallis test I am met with the dodgy boxplot below and I am told pairwise comparisons cannot be done due to less than 3 test fields (i.e. 2).
I am at an absolute loss, I have tried rewriting the data out, copying data labels with 'stratified' 'strat' 's' etc and I have no idea where the problem could lie, if anyone could give me any guidance this would be really appreciated!
Thank you

The dependent variable in question appears to have only values 1, 2, and 3 in the Stratified group. If there is at least one case with a value of 1, at least one case with a value of 3, but most values at 2, then a box plot like you're seeing would be expected. In SPSS, run the EXAMINE procedure (Analyze>Descriptive Statistics>Explore in the menus), specifying the same dependent variable and grouping variable, and asking for percentiles. The box plots should match what you're getting, and in the percentiles table you should see that Tukey's hinges show the same value of 2 for the 25th, 50th, and 75th percentiles.
Tukey's hinges are the basis for the box and the line in box plots. The line is at the median or 50th percentile, and the upper and lower box edges are at the 25th and 75h percentiles, respectively. When all three coincide, you get just a line instead of a box.
There are two types of outlying values identified in box plots in SPSS. Points greater than 1.5 box lengths below or above the box edges are outliers, marked with circles, and points greater than 3 box lengths below or above the box edges are extremes, marked with asterisks. Since the box length here is 0, anything at other values is automatically an extreme.
Pairwise comparisons following a Kruskal-Wallis test are available only when there are at least three groups, since with only two groups the overall or omnibus test has already compared the two groups. I'm not sure what the issue was when trying to run a Mann-Whitney test.

How to fine tune an FCN-32s for interactive object segmentation

I'm trying to implement the proposed model in a CVPR paper (Deep Interactive Object Selection) in which the data set contains 5 channels for each input sample:
1.Red
2.Blue
3.Green
4.Euclidean distance map associated to positive clicks
5.Euclidean distance map associated to negative clicks (as follows):
To do so, I should fine tune the FCN-32s network using "object binary masks" as labels:
As you see, in the first conv layer I have 2 extra channels, so I did net surgery to use pretrained parameters for the first 3 channels and Xavier initialization for 2 extras.
For the rest of the FCN architecture, I have these questions:
Should I freeze all the layers before "fc6" (except the first conv layer)? If yes, how the extra channels of the first conv will be learned? Are the gradients strong enough to reach the first conv layer during training process?
What should be the kernel size of the "fc6"? should I keep 7? I saw in "Caffe net_surgery" notebook that it depends on the output size of the last layer ("pool5").
The main problem is the number of outputs of the "score_fr" and "upscore" layers, since I'm not doing class segmentation (to use 21 for 20 classes and the background), how should I change it? What about 2? (one for object and the other for the non-object (background) area)?
Should I change "crop" layer "offset" to 32 to have center crops?
In case of changing each of these layers, what is the best initialization strategy for them? "bilinear" for "upscore" and "Xavier" for the rest?
Should I convert my binary label matrix values into zero-centered ( {-0.5,0.5} ) status, or it is OK to use them with the values in {0,1} ?
Any useful idea will be appreciated.
PS:
I'm using Euclidean loss, while I'm using "1" as the number of outputs for "score_fr" and "upscore" layers. If I use 2 for that, I guess it should be softmax.

I can answer some of your questions.
The gradients will reach the first layer so it should be possible to learn the weights even if you freeze the other layers.
Change the num_output to 2 and finetune. You should get a good output.
I think you'll need to experiment with each of the options and see how the accuracy is.
You can use the values 0,1.

MATLAB: Using CONVN for moving average on Matrix

I'm looking for a bit of guidance on using CONVN to calculate moving averages in one dimension on a 3d matrix. I'm getting a little caught up on the flipping of the kernel under the hood and am hoping someone might be able to clarify the behaviour for me.
A similar post that still has me a bit confused is here:
CONVN example about flipping
The Problem:
I have daily river and weather flow data for a watershed at different source locations.
So the matrix is as so,
dim 1 (the rows) represent each site
dim 2 (the columns) represent the date
dim 3 (the pages) represent the different type of measurement (river height, flow, rainfall, etc.)
The goal is to try and use CONVN to take a 21 day moving average at each site, for each observation point for each variable.
As I understand it, I should just be able to use a a kernel such as:
ker = ones(1,21) ./ 21.;
mat = randn(150,365*10,4);
avgmat = convn(mat,ker,'valid');
I tried playing around and created another kernel which should also work (I think) and set ker2 as:
ker2 = [zeros(1,21); ker; zeros(1,21)];
avgmat2 = convn(mat,ker2,'valid');
The question:
The results don't quite match and I'm wondering if I have the dimensions incorrect here for the kernel. Any guidance is greatly appreciated.

Judging from the context of your question, you have a 3D matrix and you want to find the moving average of each row independently over all 3D slices. The code above should work (the first case). However, the valid flag returns a matrix whose size is valid in terms of the boundaries of the convolution. Take a look at the first point of the post that you linked to for more details.
Specifically, the first 21 entries for each row will be missing due to the valid flag. It's only when you get to the 22nd entry of each row does the convolution kernel become completely contained inside a row of the matrix and it's from that point where you get valid results (no pun intended). If you'd like to see these entries at the boundaries, then you'll need to use the 'same' flag if you want to maintain the same size matrix as the input or the 'full' flag (which is default) which gives you the size of the output starting from the most extreme outer edges, but bear in mind that the moving average will be done with a bunch of zeroes and so the first 21 entries wouldn't be what you expect anyway.
However, if I'm interpreting what you are asking, then the valid flag is what you want, but bear in mind that you will have 21 entries missing to accommodate for the edge cases. All in all, your code should work, but be careful on how you interpret the results.
BTW, you have a symmetric kernel, and so flipping should have no effect on the convolution output. What you have specified is a standard moving averaging kernel, and so convolution should work in finding the moving average as you expect.
Good luck!