cellfun randperm downsampling: keep track of indexes (matlab) - matlab

I have this snippet of code, downsampling data from a cell array based on the length of its shortest element.
sizeShortest = min(cellfun('size', data, 2));
f=#(x)(x(:,sort(getfield(randperm(size(x,2)),{1:sizeShortest}))));
dummy = cellfun(f, data, 'UniformOutput', false);
I would also like to keep track of the indexes of the elements of data saved in dummy (Basically what's the 1:sizeShortest values from the randperm call).
I couldn't find an answer so far...any help is greatly appreciated!

From our comments, you can simply take the code that is indexing your array and obtain your indices:
f=#(x) sort(getfield(randperm(size(x,2)),{1:sizeShortest}));
indices = cellfun(f, data, 'UniformOutput', false);
In essence, you'd run cellfun twice. Once for getting the data, and again for the indices. However, as you have cleverly stated in your comments, running randperm subsequent times will generate a new sequence of numbers. You can "cheat" by resetting the random generator seed before you call the next randperm call. What this does is that once you reset the seed, the random numbers that you generate should be reproducible. You can do this by the rng function. This accepts any integer that will act as a seed. Once you set this up, the numbers generated by any random generation function will follow a prescribed format that relies on this seed.
As an example:
rng(10)
C = randperm(10)
This gives:
C =
2 10 9 7 6 5 3 4 8 1
If you run those two statements again, you will get the same sequence stored in C. As such, set your random number generator to any integer you want. Run the first cellfun command... then after, reset to the same seed and run the second cellfun command.
Now to make this truly dynamic, you'll want to choose a random seed first. Store this number for later so that you can use this same seed for both cellfun calls. You could poll the current time of the day, and use the seconds to give you your custom seed.
In other words:
c = clock;
seedNum = c(5); %// c(5) is the number of seconds in the minute you're currently on.
rng(seedNum);
%// Do your first set of commands here...
rng(seedNum);
%// Do your second set of commands here...
If you want to reset it back to default after you're done, you can call rng('default'); What this will do is that it will reset the random number generation as if you restarted MATLAB. You don't have to, but I think it's good practice!
Good luck!

Related

behavior of matlab's rng() function in loop

With rng() before for loop Matlab generate one array of random number, within for loop another one. Both results are repeatable, so rng() seeds work. But I want to know the reason of such behavior.I was expecting the results to be same. I think actually it is not because of rng but for loop
Code exapmle?
for i = 1:2
rng(1,'philox');
disp(randn(2,1)); % 1st number is 0.0906, 2nd one is -0.7327
end
rng(1,'philox');
for i = 1:2
disp(randn(2,1)); % 1st number is 0.7565, 2nd one is -0.7096
end
Shouldn't the results be same? Isn't rng(1,..) storing same array of numbers for seed 1
Your code works as you expect and would describe, I believe. You are definitely not showing in your code example all the outputs. This is what I get when running it.
>> test
First case:
0.0906
-0.7327
0.0906
-0.7327
Second case:
0.0906
-0.7327
0.7565
-0.7096
In the first case, you reset the random number generator inside the loop, and there fore you get twice the same results (two numbers). In the second case, you only set it once, and therefore you get the same numbers as before, and then the second loop you get the next 2 corresponding random numbers produced by the algorithm. These algorithms, once started, will produce an infinite amount of different numbers, and they won't produce the same unless you explicitly call the restart of the algorithm with the seed, like you do in the first example.
All this is way less confusing if you call disp(randn(1)) inside the loop, instead of generating 2 numbers each time.

Encoding a binary vector in a suitable way in Matlab

The context and the problem below are only examples that can help to visualize the question.
Context: Let's say that I'm continously generating random binary vectors G with length 1x64 (whose values are either 0 or 1).
Problem: I don't want to check vectors that I've already checked, so I want to create a kind of table that can identify what vectors are already generated before.
So, how can I identify each vector in an optimized way?
My first idea was to convert the binary vectors into decimal numbers. Due to the maximum length of the vectors, I would need 2^64 = 1.8447e+19 numbers to encode them. That's huge, so I need an alternative.
I thought about using hexadecimal coding. In that case, if I'm not wrong, I would need nchoosek(16+16-1,16) = 300540195 elements, which is also huge.
So, there are better alternatives? For example, a kind of hash function that can identify that vectors without repeating values?
So you have 64 bit values (or vectors) and you need a data structure in order to efficiently check if a new value is already existing?
Hash sets or binary trees come to mind, depending on if ordering is important or not.
Matlab has a hash table in containers.Map.
Here is a example:
tic;
n = 1e5; % number of random elements
keys = uint64(rand(n, 1) * 2^64); % random uint64
% check and add key if not already existing (using a containers.Map)
map = containers.Map('KeyType', 'uint64', 'ValueType', 'logical');
for i = 1 : n
key = keys(i);
if ~isKey(map, key)
map(key) = true;
end
end
toc;
However, depending on why you really need that and when you really need to check, the Matlab function unique might also be something for you.
Just throwing out duplicates once at the end like:
tic;
unique_keys = unique(keys);
toc;
is in this example 300 times faster than checking every time.

Matlab vectorization of multiple embedded for loops

Suppose you have 5 vectors: v_1, v_2, v_3, v_4 and v_5. These vectors each contain a range of values from a minimum to a maximum. So for example:
v_1 = minimum_value:step:maximum_value;
Each of these vectors uses the same step size but has a different minimum and maximum value. Thus they are each of a different length.
A function F(v_1, v_2, v_3, v_4, v_5) is dependant on these vectors and can use any combination of the elements within them. (Apologies for the poor explanation). I am trying to find the maximum value of F and record the values which resulted in it. My current approach has been to use multiple embedded for loops as shown to work out the function for every combination of the vectors elements:
% Set the temp value to a small value
temp = 0;
% For every combination of the five vectors use the equation. If the result
% is greater than the one calculated previously, store it along with the values
% (postitions) of elements within the vectors
for a=1:length(v_1)
for b=1:length(v_2)
for c=1:length(v_3)
for d=1:length(v_4)
for e=1:length(v_5)
% The function is a combination of trigonometrics, summations,
% multiplications etc..
Result = F(v_1(a), v_2(b), v_3(c), v_4(d), v_5(e))
% If the value of Result is greater that the previous value,
% store it and record the values of 'a','b','c','d' and 'e'
if Result > temp;
temp = Result;
f = a;
g = b;
h = c;
i = d;
j = e;
end
end
end
end
end
end
This gets incredibly slow, for small step sizes. If there are around 100 elements in each vector the number of combinations is around 100*100*100*100*100. This is a problem as I need small step values to get a suitably converged answer.
I was wondering if it was possible to speed this up using Vectorization, or any other method. I was also looking at generating the combinations prior to the calculation but this seemed even slower than my current method. I haven't used Matlab for a long time but just looking at the number of embedded for loops makes me think that this can definitely be sped up. Thank you for the suggestions.
No matter how you generate your parameter combination, you will end up calling your function F 100^5 times. The easiest solution would be to use parfor instead in order to exploit multi-core calculation. If you do that, you should store the calculation results and find the maximum after the loop, because your current approach would not be thread-safe.
Having said that and not knowing anything about your actual problem, I would advise you to implement a more structured approach, like first finding a coarse solution with a bigger step size and narrowing it down successivley by reducing the min/max values of your parameter intervals. What you have currently is the absolute brute-force method which will never be very effective.

Create different `randperm` numbers in loops

Suppose that we have this structure:
for i=1:x1
Out = randperm(40);
Out_Final = %% divide 'Out' to 10 parts. and select these parts for some purposes
for j=1:x2
%% Process on `Out_Final`
end
end
I'm using outer loop (for i=1:x1) to repeat main process (for j=1:x2) loop and average between outputs to have more robust results. I want randperm doesn't result equal (or near equal) outputs. I want have different Output for this function as far as possible in every calling in (for i=1:x1) loop.
How can i do that in MATLAB R2014a?
The randomness algorithms used by randperm are very good. So, don't worry about that.
However, if you draw 10 random numbers from 1 to 10, you are likely to see some more frequently than others.
If you REALLY don't want this, you should probably not focus on randomly selecting the numbers, but on selecting the numbers in a way that they are nicely spread out througout their possible range. (This is a quite different problem to solve).
To address your comment:
The rng function allows you to create reproducible results, make sure to check doc rng for examples.
In your case it seems like you actually don't want to reset the rng each time, as that would lead to correlated random numbers.

MatLab Missing data handling in categorical data

I am trying to put my dataset into the MATLAB [ranked,weights] = relieff(X,Ylogical,10, 'categoricalx', 'on') function to rank the importance of my predictor features. The dataset<double n*m> has n observations and m discrete (i.e. categorical) features. It happens that each observation (row) in my dataset has at least one NaN value. These NaNs represent unobserved, i.e. missing or null, predictor values in the dataset. (There is no corruption in the dataset, it is just incomplete.)
relieff() uses this function below to remove any rows that contain a NaN:
function [X,Y] = removeNaNs(X,Y)
% Remove observations with missing data
NaNidx = bsxfun(#or,isnan(Y),any(isnan(X),2));
X(NaNidx,:) = [];
Y(NaNidx,:) = [];
This is not ideal, especially for my case, since it leaves me with X=[] and Y=[] (i.e. no observations!)
In this case:
1) Would replacing all NaN's with a random value, e.g. 99999, help? By doing this, I am introducing a new feature state for all the predictor features so I guess it is not ideal.
2) or is replacing NaNs with the mode of the corresponding feature column vector (as below) statistically more sound? (I am not vectorising for clarity's sake)
function [matrixdata] = replaceNaNswithModes(matrixdata)
for i=1: size(matrixdata,2)
cv= matrixdata(:,i);
modevalue= mode(cv);
cv(find(isnan(cv))) = modevalue;
matrixdata(:,i) = cv;
end
3) Or any other sensible way that would make sense for "categorical" data?
P.S: This link gives possible ways to handle missing data.
I suggest to use a table instead of a matrix.
Then you have functions such as ismissing (for the entire table), and isundefined to deal with missing values for categorical variables.
T = array2table(matrix);
T = standardizeMissing(T); % NaN is standard for double but this
% can be useful for other data type
var1 = categorical(T.var1);
missing = isundefined(var1);
T = T(missing,:); % removes lines with NaN
matrix = table2array(T);
For a start both solutiona (1) and (2) do not help you handle your data more properly, since NaN is in fact a labelling that is handled appropriately by Matlab; warnings will be issued. What you should do is:
Handle the NaNs per case
Use try catch blocks
NaN is like a number, and there is nothing bad about it. Even is you divide by NaN matlab will treat it properly and give you a NaN.
If you still want to replace them, then you will need an assumption that holds. For example, if your data is engine speeds in a timeseries that have been input by the engine operator, but some time instances have not been specified then there are more than one ways to handle the NaN that will appear in the matrix.
Replace with 0s
Replace with the previous value
Replace with the next value
Replace with the average of the previous and the next value
and many more.
As you can see your problem is ill-posed, and depends on the predictor and the data source.
In case of categorical data, e.g. three categories {0,1,2} and supposing NaN occurs in Y.
for k=1:size(Y,2)
[ id ]=isnan(Y(:,k);
m(k)=median(Y(~id),k);
Y(id,k)=round(m(k));
end
I feel really bad that I had to write a for-loop but I cannot see any other way. As you can see I made a number of assumptions, by using median and round. You may want to use a threshold depending on you knowledge about the data.
I think the answer to this has been given by gd047 in dimension-reduction-in-categorical-data-with-missing-values:
I am going to look into this, if anyone has any other suggestions or particular MatLab implementations, it would be great to hear.
You can take a look at this page http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html the firs a1a, it says transforming categorical into binary. Could possibly work. (: