deterministic assignment of symbols to groups with Matlab gscatter function

deterministic assignment of symbols to groups with Matlab gscatter function - matlab

Suppose that a data set can be divided into three groups (e.g., negatives, zeroes, and positives) and that one desires to represent these groups in a plot using different symbols, assigning < to negatives, o to zeroes, and > to positives; e.g.,
gscatter([-1 -1 0 0 1 1],[1 2 1 2 1 2],[-1 -1 0 0 1 1],'k','<o>',10); xlim([-3 3]); ylim([0 3]);
Suppose further that the input data to the gscatter function lack representation from all groups. The group-symbol relationship could then change because, according to the Matlab documentation, gscatter sequentially assigns symbols from the provided list to groups based upon the sorted order of the unique values of the grouping variable. The upshot of this grouping algorithm is that absence of representation from earlier-sorting groups produces a shift in the symbol/group assignment, thereby destroying symbolic significance (the precise symbol assigned to a group may be immaterial, but this question focuses on those cases where a particular symbol must invariably be assigned to a particular group). For instance, for a data set lacking negative values, gscatter would assign the < symbol to zeroes and the o symbol to positives (the > symbol going unused because a third symbol is extraneous when only two distinct groups exist); e.g.,
gscatter([-1 -1 0 0 1 1],[1 2 1 2 1 2],[-1 -1 0 0 1 1],'k','<o>',10); xlim([-3 3]); ylim([0 3]);
My question is whether one can deterministically assign a symbol to a particular group in cases where there is a possibility of missing groups from a data set (e.g., can one mandate assignment of < to negative values even when no such values are present in the data set, to avoid shifting of the symbol/group relationship described above). The Matlab documentation seems to indicate that such an operation is impossible, meaning that one would have to rely on a series of if statements to determine whether certain groups are missing and to appropriately redefine restricted symbol sets for each possible combination of group representations, but I seek to know whether this limitation can be circumvented more elegantly.

In general, for "processing data that may or may not be present" problems, there's always the horrible cheat of forcing the necessary parts of the data to exist:
x = [-1 -1 0 0 1 1];
y = [1 2 1 2 1 2];
group = [-1 -1 0 0 1 1];
gscatter([x NaN NaN NaN], [y NaN NaN NaN], [group -1 0 1], 'k', '<o>', 10);
(if, unlike plot, gscatter simply ignores a group of only NaN - I don't have it available to actually test - you could just use any coordinates well outside the final axes range)
Another possibility is changing the process itself to ensure consistency regardless of the data:
scatter(x(group == -1), y(group == -1), 10, 'k', '<');
hold on;
scatter(x(group == 0), y(group == 0), 10, 'k', 'o');
scatter(x(group == 1), y(group == 1), 10, 'k', '>');
hold off;
However, in this case explicitly checking the data and adjusting for what is present is almost certainly the nicest approach, given an appropriate Matlab idiom:
markers = '<o>';
midx = ismember([-1 0 1], group);
gscatter(x, y, group, 'k', markers(midx), 10);

Related

How to create an adjacency/joint probability matrix in matlab

From a binary matrix, I want to calculate a kind of adjacency/joint probability density matrix (not quite sure how to label it as so please feel free to rename).
For example, I start with this matrix:
A = [1 1 0 1 1
1 0 0 1 1
0 0 0 1 0]
I want to produce this output:
Output = [1 4/5 1/5
4/5 1 1/5
1/5 1/5 1]
Basically, for each row, I want to calculate the proportion of times where they agreed (1 and 1 or 0 and 0). A will always agree with itself and thus have it as 1 along the diagonal. No matter how many different js are added it will still result in a 3x3, but an extra i variable will result in a 4x4.
I like to think of the inputs along i in the A matrix as the person and Js as the question and so the final output is a 3x3 (number of persons) matrix.
I am having some trouble with this on matlab. If you could please help point me in the right direction that would be fabulous.

So, you can do this in two parts.
bothOnes = A*A';
gives you a matrix showing how many 1s each pair of rows share, and
bothZeros = (1-A)*(1-A)';
gives you a matrix showing how many 0s each pair of rows share.
If you just add them up, you get how many elements they share of either type:
bothSame = A*A' + (1-A)*(1-A)';
Then just divide by the row length to get the desired fractional representation:
output = (A*A' + (1-A)*(1-A)') / size(A, 2);
That should get you there.
Note that this only works if A contains only 1's and 0's, but it can be adapted for other cases.

Here are some alternatives, assuming A can only contain 0 and 1:
If you have the Statistics Toolbox:
result = 1-squareform(pdist(A, 'hamming'));
Manual approach with implicit expansion:
result = mean(permute(A, [1 3 2])==permute(A, [3 1 2]), 3);
Using bitwise operations. This is a more esoteric approach, and is only valid if A has at most 53 columns, due to floating-point limitations:
t = bin2dec(char(A+'0')); % convert each row from binary to decimal
u = bitxor(t, t.'); % bitwise xor
v = mean(dec2bin(u)-'0', 2); % compute desired values
result = 1 - reshape(v, size(A,1), []); % reshape to obtain result

MATLAB genetic algorithm constraint (all variables can't be zero at the same time in a binary environment)

I'm using MATLAB ga function for my optimization problem. In my problem, I have some decision variables which are integer (0 and 1: I specified lower bound, upper bound, and IntCon for it) plus two continues varibales. Otherwise, all integer variables can't be zero at same time, so at least, we need a single "one" variable among integers. How can I implement mentioned constraint in MATLAB?

This is a Mixed-Integer optimization problem and it can be solved using ga in MATLAB. As mentioned in the documentations: ga can solve problems when certain variables are integer-valued. Not all the variables but certain variables. So you should have at least one real variable among the other integers.
Whit IntCon options, you can specify which variables are integer, for instance IntCon=[1 3] means that your first and third variables are integer. To avoid both integer variables to be 0 at the same time, I think you can add some inequality constraints.
For instance look at the following example:
Let's say we want to find the optimum value for the Ackley function with 5 variables (e.g. in 5 dimensions), [x(1)...x(5)] and let's assume that the first and third variables, x(1) and x(3), are integers. We can write the following script:
nVar = 5;
lb = -5*ones(1,nVar); % define the upper bound
ub = 5*ones(1,nVar); % define the lower bound
rng(1,'twister') % for reproducibility
opts = optimoptions('ga','MaxStallGenerations',50,'FunctionTolerance',1e-3,'MaxGenerations',300);
[x,~,~] = ga(#ackleyfcn,nVar,[],[],[],[],lb,ub,[],[1 3],opts);
disp('solution:');disp(x)
On my machine, I get this solution:
solution:
0 -0.000000278963321 0 0.979067345808285 -0.000000280775000
It can be seen that x(1) and x(3) are integers and both 0. Now let's say as you mentioned, they both cannot be 0 at the same time and if one is 0 the other should be 1. Here the boundaries of the Ackley's problem allows the variables to be in the range defined by lower and upper bounds. However, in your case the lower and upper bounds should be defined as [0] and [1] for both integer variables.
Now I want to avoid both variables to be 0, so I can write the following linear inequality constraint:
% x(1) + x(3) >= 1
% x(1) >= 0
% x(3) > 0
These inequalities should be written in form Ax <= b:
A = [-1 0 -1 0 0
-1 0 0 0 0
0 0 -1 0 0];
b = [-1
0
0];
Now if we run the optimization problem again we see the effect of the constraints on the output:
[x,~,~] = ga(#ackleyfcn,nVar,A,b,[],[],lb,ub,[],[1 3],opts);
disp('solution');disp(x)
solution
1.000000000000000 -0.000005031565831 0 -0.000011740569861 0.000008060759466

Sparse matrix in matlab: set unrecorded elements to -1 instead of 0

I want to create a sparse matrix consisting mainly of -1, but also includes some 0 and 1. This is part of a larger project, so it is important that I do not switch -1 with 0. As default, sparse(A) in Matlab keeps track of only non-zero elements. Is there a way to keep track of only non-(minus one) elements? For example, if
A =
-1 -1 -1 0
1 -1 -1 -1
Then
new_sparse(A) =
(1,4) = 0
(2,1) = 1
Thanks!

No, there is no way to override sparse to use different values. What you can do, though time and memory consuming, is to use accumarray:
x_ind; % I presume this to contain the column index of the number
y_ind; % I presume this to contain the row index of the number
value; % I presume this to contain the value (0 or 1)
new_mat = accumarray([x_ind y_ind],value,[],[],-1);
new_mat now will contain your prescribed 0 and 1 values, and has -1 on all other locations. You do not have to set the size argument (the third) since it will just create a matrix of max(x_ind) x max(y_ind) size if you put []. The fourth input argument, the function, can be empty as well, since each combination of x_ind and y_ind will contain only one value, thus the default, mean is sufficient.
An example:
A = [0 1 ; -1 0];
x_ind = [1;2;2];
y_ind = [1;1;2];
value = [0;1;0];
new_mat = accumarray([x_ind y_ind],value,[],[],-1);
new_mat =
0 1
-1 0
A different method which I'd prefer is to simply add one to all values, thus making your 1 2 and setting your 0 to 1. This way -1 is mapped to 0 and therefore you're clear to use sparse anyway. In the example this would set A = [1 2;0 1] which you can call with your respective values using A-1.
Just as a note: sparse stores three values for each element (row, column, value), plus some overhead. So if your matrix is less than about 70% empty, sparse is actually consuming more memory than a regular, full matrix.

Adding additional ones that surround other values of one in a vector in MATLAB

Given a vector of zeros and ones in MATLAB, where the zeros represent an event in time, I would like to add additional ones before and after the existing ones in order to capture additional variation.
Example: I would like to turn [0;0;1;0;0] into [0;1*;1;1*;0] where 1* are newly added ones.

Assuming A to be the input column vector -
%// Find all neighbouring indices with a window of [-1 1]
%// around the positions/indices of the existing ones
neigh_idx = bsxfun(#plus,find(A),[-1 1])
%// Select the valid indices and set them in A to be ones as well
A(neigh_idx(neigh_idx>=1 & neigh_idx<=numel(A))) = 1
Or use imdilate from Image Processing Toolbox with a vector kernel of ones of length 3 -
A = imdilate(A,[1;1;1])

You can do it convolving with [1 1 1], and setting to 1 all values greater than 0. This works for column or row vactors.
x = [0;0;1;0;0];
y = double(conv(x, [1 1 1],'same')>0)

Purely by logical indexing:
>> A = [0 1 1 0 0];
>> A([A(2:end) 0] == 1 | [0 A(1:end-1)] == 1) = 1;
>> disp(A);
A =
1 1 1 1 0
This probably merits an explanation. The fact that it's a 3 element local neighbourhood makes this easy. Essentially, take two portions of the input array:
Portion #1: A starting from the second element to the last element
Portion #2: A starting from the first element to the second-last element
We place the first portion into a new array and add 0 at the end of this array, and check to see which locations are equal to 1 in this new array. This essentially shifts the array A over to the left by 1. Whichever locations in this first portion are equal to 1, we set the corresponding locations in A to be 1. The same thing for the second portion where we are effectively shifting the array A over to the right by 1. To shift to the right by 1, we prepend a 0 at the beginning, then extract out the second portion of the array. Whichever locations in this second portion are equal to 1 are also set to 1.
At the end of this operation, you would essentially shift A to the left by 1 and save this as a separate array. Also, you would shift to the right by 1 and save this as another array. With these two, you simply overlap on top of the original to obtain the final result.
The benefit of this method over its predecessors in this post is that this doesn't require computations of any kind (bsxfun, conv, imdilate etc.) and purely relies on indexing into arrays and using logical operators1. This also handles boundary conditions and can work on either row or column vectors.
Some more examples with boundary cases
>> A = [0 0 1 1 0];
>> A([A(2:end) 0] == 1 | [0 A(1:end-1)] == 1) = 1
A =
0 1 1 1 1
>> A = [0 0 0 0 1];
>> A([A(2:end) 0] == 1 | [0 A(1:end-1)] == 1) = 1
A =
0 0 0 1 1
>> A = [1 0 1 0 1];
>> A([A(2:end) 0] == 1 | [0 A(1:end-1)] == 1) = 1
A =
1 1 1 1 1
1: This post is dedicated to Troy Haskin, one who believes that almost any question (including this one) can be answered by logical indexing.

Finding contiguous points in an increasing range

I have a set of data points in a vector. For example,
[NaN, NaN, NaN, -1.5363, NaN -1.7664, -1.7475];
These data result from a code which selects 3 points within a specified range (specifically. -0.6 an 0.6). If three points from the column do not exist in this range, the range is incrementally expanded until three points are found. In the above example, the range was increased to -1.8 to 1.8. However, the data we are analyzing is erratic, and has random peaks and troughs, leading to points which are non-contiguous being accepted into the range (element 3 is chosen to be valid, but not element 4).
What would be the best way to go about this? I already have a code to incrementally increase the range to find three points, I just need to modify it to not stop at any three points, but to increase the range until it finds three CONTIGUOUS points. If that were done for the above example, I would just evaluate slopes to remove the 3rd element (since between 3 and 4, the slope is negative).
Thanks.

Assuming your data as provided in the example is in the variable x, you can use isnan and findstr like so:
x = [NaN, NaN, NaN, -1.5363, NaN -1.7664, -1.7475, 123];
~isnan(x)
ans =
0 0 0 1 0 1 1 1
pos = findstr(~isnan(x), [1 1 1]);
The reason for using findstr like this is that we would like to find the sequence [1 1 1] within the logical array returned by isnan, and findstr will return the index of the positions in the input array where this sequence appears.
For your example data, this will return [], but if you change it to the data in the example I have given, it will return 6, and you can extract the contiguous region with x(pos:pos+2). You will have to be a bit careful about cases where there are more than 3 contiguous values (if there were 4, it would return [6 7]) and the cases where there is more than one contiguous region. If you don't need to do anything meaningful with these cases then just use pos(1).
If you want to extract the entirety of the first contiguous region whose length is greater than or equal to 3, you could do something like:
x = [NaN, NaN, NaN, -1.5363, NaN -1.7664, -1.7475, 123, 456, 789];
startPos = [];
stopPos = [];
pos = findstr(~isnan(x), [1 1 1]);
if ~isempty(pos)
startPos = pos(1);
stopPos = startPos + 2;
% Find any cases where we have consecutive numbers in pos
if length(pos) > 1 && any(diff(pos) == 1)
% We have a contiguous section longer than 3 elements
% Find the NaNs
nans = find(isnan(x));
% Find the first NaN after pos(1), or the index of the last element
stopPos = nans(nans > startPos);
if ~isempty(stopPos)
stopPos = stopPos(1) - 1; % Don't want the NaN
else
stopPos = length(x);
end
end
end
x(startPos:stopPos)