MATLAB find on multiple columns to multiple column result - matlab

Setup
I have an array of captured data. The data may be captured on just 1 device or up to a dozen devices, with each device being a column in the array. I have a prior statement which I execute on the array to then turn it into a logical array to find particular points of interest in the data. Due to the nature of the data, there are many 0's and only a few 1's. I need to return an array with the indices of the 1's so I can go back and capture the data between those points (see update below).
find is an obvious choice for a function - however, the result I need, needs to have 1 column for each device. Normally find will do a linear index regardless of the dimensions of the array.
The devices follow a pattern - but aren't exactly the same. So, complicating this is the fact that the number of 1's in each column is close to, but not guaranteed to be exactly the same depending on the exact timing the data capture is stopped (they are most often different from each other by 1 element, but could be different by more).
MATLAB CODE ATTEMPTS
Because of that difference, I can't use the following simple code:
for p = 1:np
indices( :, p ) = find( device.data.cross( :, p ) );
end
Notes:
np is the number of columns in the data = number of devices captured.
devices is a class representing the collection of devices
data is a TimeTable containing captured data on all the devices
cross is a column in the data TimeTable which contains the logical array
Even this simple code is inefficient and generates the Code Analyzer warning:
The variable 'indices' appears to change size on every loop
iteration (within a script). Consider preallocating for speed.
As expected, it doesn't work as I get an error similar to the following:
Unable to perform assignment because the size of the left side is
448-by-1 and the size of the right side is 449-by-1.
I know why I get this error - each column in an array in MATLAB has to have the same number of rows, so I can't make the assignment if the row size doesn't match. I need to pad the "short" columns somehow. In this case, repeating the last index will work for my later operations without causing an error.
I can't figure out a "good" way to do this. I can't pre-populate the array rows because I don't know how many rows there will be until I've done the find operation.
I can change the code as follows:
indices = [];
for p = 1:np
tempindices = find( devices.data.cross(:, p) );
sizediff = size( tempindices, 1 ) - size( indices, 1 );
if p > 1
if sizediff > 0
padding = repmat(indices(end, 1:(p - 1)), sizediff, 1);
indices = [indices; padding];
elseif sizediff < 0
padding = repmat(tempindices(end), abs(sizediff), 1);
tempindices = [tempindices; padding];
end
end
indices(:,p) = tempindices;
end
Note: padarray would have been useful here, but I don't have the Image Processing Toolbox so I cannot use it.
This code works, but it is very inefficient, it creates multiple otherwise unneeded variables in the workspace and generates multiple "appears to change size on every loop iteration" warnings in Code Analyzer. Is there a more efficient way to do this?
Update / Additional Information:
Some more context is needed for my issue. Given that devices.data.cross is a logical array, to just "pick" the data I want from other columns in my table (as I originally described my problem) I could select a column from devices.data.cross and pass that logical column as a subscript to get that data. I do that where it works. However, for some of the columns I need to select "chunks" of the data between the indices and that's where (I think) I need the indices. Or, at least I don't know of another way to do it.
Here is example code of how I use the indices:
for p = 1:np
for i = 2:num_indices
these_indices = indices(i-1, p):( indices(i, p) - 1 );
rmsvoltage = sqrt( mean( devices.data.voltage(these_indices).^2 ) );
end
end
This is just one routine I do on the "chunks" of data. I also have a couple of functions where these chunks of data are passed for processing.

When I understood your problem correctly, the code below should work. I'm using the approach that Cris Luengo suggested in a comment under your question.
Key element is [rowIdcs, colIdcs] = find( cros ); which gives you the subscripts of positions in cros having a value of one. Please find further comments inline.
% Create some data for testing
volt = randn(10,10);
cros = randi(10,10,10) > 9;
% Get rowIdcs and colIdcs, which have both a size of Nx1,
% with N denoting the number of ones in the mask.
% rowIdcs and colIdcs are the subscripts of the ones in the mask.
[rowIdcs, colIdcs] = find( cros );
% The number of chunks is equal to number N of ones found in the mask;
numChunks = numel( rowIdcs );
% Initilize a vector for the rms
rms = zeros( numChunks, 1 );
% Loop over the chunks
for k = 1 : numChunks
curRow = rowIdcs(k);
curCol = colIdcs(k);
% Get indices of range over neighbouring rows
chunkRowIdcs = curRow + [-1 0 1]; %i.e. these_indices in your example
% Remove indices that are out of range
chunkRowIdcs( chunkRowIdcs < 1 | chunkRowIdcs > size(volt,1) ) = [];
% Get voltages covered by chunk
chunkVoltages = volt( chunkRowIdcs, curCol );
% Get RMS over voltages
rms(k) = sqrt( mean( chunkVoltages(:).^2 ));
end

Related

MATLAB: vectors of different length

I want to create a MATLAB function to import data from files in another directory and fit them to a given model, but because the data need to be filtered (there's "thrash" data in different places in the files, eg. measurements of nothing before the analyzed motion starts).
So the vectors that contain the data used to fit end up having different lengths and so I can't return them in a matrix (eg. x in my function below). How can I solve this?
I have a lot of datafiles so I don't want to use a "manual" method. My function is below. All and suggestions are welcome.
datafit.m
function [p, x, y_c, y_func] = datafit(pattern, xcol, ycol, xfilter, calib, p_calib, func, p_0, nhl)
datafiles = dir(pattern);
path = fileparts(pattern);
p = NaN(length(datafiles));
y_func = [];
for i = 1:length(datafiles)
exist(strcat(path, '/', datafiles(i).name));
filename = datafiles(i).name;
data = importdata(strcat(path, '/', datafiles(i).name), '\t', nhl);
filedata = data.data/1e3;
xdata = filedata(:,xcol);
ydata = filedata(:,ycol);
filter = filedata(:,xcol) > xfilter(i);
x(i,:) = xdata(filter);
y(i,:) = ydata(filter);
y_c(i,:) = calib(y(i,:), p_calib);
error = #(par) sum(power(y_c(i,:) - func(x(i,:), par),2));
p(i,:) = fminsearch(error, p_0);
y_func = [y_func; func(x(i,:), p(i,:))];
end
end
sample data: http://hastebin.com/mokocixeda.md
There are two strategies I can think of:
I would return the data in a vector of cells instead, where the individual cells store vectors of different lengths. You can access data the same way as arrays, but use curly braces: Say c{1}=[1 2 3], c{2}=[1 2 10 8 5] c{3} = [ ].
You can also filter the trash data upon reading a line, if that makes your vectors have the same length.
If memory is not an major issue, try filling up the vectors with distinct values, such as NaN or Inf - anything, that is not found in your measurements based on their physical context. You might need to identify the longest data-set before you allocate memory for your matrices (*). This way, you can use equally sized matrices and easily ignore the "empty data" later on.
(*) Idea ... allocate memory based on the size of the largest file first. Fill it up with e.g. NaN's
matrix = zeros(length(datafiles), longest_file_line_number) .* NaN;
Then run your function. Determine the length of the longest consecutive set of data.
new_max = length(xdata(filter));
if new_max > old_max
old_max = new_max;
end
matrix(i, length(xdata(filter))) = xdata(filter);
Crop your matrix accordingly, before the function returns it ...
matrix = matrix(:, 1:old_max);

Matlab: Random sample with replacement

What is the best way to do random sample with replacement from dataset? I am using 316 * 34 as my dataset. I want to segment the data into three buckets but with replacement. Should I use the randperm because I need to make sure I keep the index intact where that index would be handy in identifying the label data. I am new to matlab I saw there are couple of random sample methods but they didn't look like its doing what I am looking for, its strange to think that something like doesn't exist in matlab, but I did the follwoing:
My issue is when I do this row_idx = round(rand(1)*316) sometimes I get zero, that leads to two questions
what should I do to avoid zeor?
What is the best way to do the random sample with replacement.
shuffle_X = X(randperm(size(X,1)),:);
lengthOf_shuffle_X = length(shuffle_X)
number_of_rows_per_bucket = round(lengthOf_shuffle_X / 3)
bucket_cell = cell(3,1)
bag_matrix = []
for k = 1:length(bucket_cell)
for i = 1:number_of_rows_per_bucket
row_idx = round(rand(1)*316)
bag_matrix(i,:) = shuffle_X(row_idx,:)
end
bucket_cell{k} = bag_matrix
end
I could do following:
if row_idx == 0
row_idx = round(rand(1)*316)
assuming random number will never give two zeros values in two consecutive rounds.
randi is a good way to get integer indices for sampling with replacement. Assuming you want to fill three buckets with an equal number of samples, then you can write
data = rand(316,34); %# create some dummy data
number_of_data = size(data,1);
number_of_rows_per_bucket = 50;
bucket_cell = cell(1,3);
idx = randi([1,number_of_data],[number_of_rows_per_bucket,3]);
for iBucket = 1:3
bucket_cell{iBucket} = data(idx(:,iBucket),:);
end
To the question: if you use randperm it will give you a draw order without replacement, since you can draw any item once.
If you use randi it draws you with replacement, that is you draw an item possibly many times.
If you want to "segment" a dataset, that usually means you split the dataset into three distinct sets. For that you use draw without replacement (you don't put the items back; use randperm). If you'd do it with replacement (using randi), it will be incredibly slow, since after some time the chance that you draw an item which you have not before is very low.
(Details in coupon collector ).
If you need a segmentation that is a split, you can just go over the elements and independently decide where to put it. (That is you choose a bucket for each item with replacement -- that is you put any chosen bucket back into the game.)
For that:
% if your data items are vectors say data = [1 1; 2 2; 3 3; 4 4]
num_data = length(data);
bucket_labels = randi(3,[1,num_data]); % draw a bucket label for each item, independently.
for i=1:3
bucket{i} = data(bucket_labels==i,:);
end
%if your data items are scalars say data = [1 2 3 4 5]
num_data = length(data);
bucket_labels = randi(3,[1,num_data]);
for i=1:3
bucket{i} = data(bucket_labels==i);
end
there we go.

Matlab: how to implement a dynamic vector

I am refering to an example like this
I have a function to analize the elements of a vector, 'input'. If these elements have a special property I store their values in a vector, 'output'.
The problem is that at the begging I don´t know the number of elements it will need to store in 'output'so I don´t know its size.
I have a loop, inside I go around the vector, 'input' through an index. When I consider special some element of this vector capture the values of 'input' and It be stored in a vector 'ouput' through a sentence like this:
For i=1:N %Where N denotes the number of elements of 'input'
...
output(j) = input(i);
...
end
The problem is that I get an Error if I don´t previously "declare" 'output'. I don´t like to "declare" 'output' before reach the loop as output = input, because it store values from input in which I am not interested and I should think some way to remove all values I stored it that don´t are relevant to me.
Does anyone illuminate me about this issue?
Thank you.
How complicated is the logic in the for loop?
If it's simple, something like this would work:
output = input ( logic==true )
Alternatively, if the logic is complicated and you're dealing with big vectors, I would preallocate a vector that stores whether to save an element or not. Here is some example code:
N = length(input); %Where N denotes the number of elements of 'input'
saveInput = zeros(1,N); % create a vector of 0s
for i=1:N
...
if (input meets criteria)
saveInput(i) = 1;
end
end
output = input( saveInput==1 ); %only save elements worth saving
The trivial solution is:
% if input(i) meets your conditions
output = [output; input(i)]
Though I don't know if this has good performance or not
If N is not too big so that it would cause you memory problems, you can pre-assign output to a vector of the same size as input, and remove all useless elements at the end of the loop.
output = NaN(N,1);
for i=1:N
...
output(i) = input(i);
...
end
output(isnan(output)) = [];
There are two alternatives
If output would be too big if it was assigned the size of N, or if you didn't know the upper limit of the size of output, you can do the following
lengthOutput = 100;
output = NaN(lengthOutput,1);
counter = 1;
for i=1:N
...
output(counter) = input(i);
counter = counter + 1;
if counter > lengthOutput
%# append output if necessary by doubling its size
output = [output;NaN(lengthOutput,1)];
lengthOutput = length(output);
end
end
%# remove unused entries
output(counter:end) = [];
Finally, if N is small, it is perfectly fine to call
output = [];
for i=1:N
...
output = [output;input(i)];
...
end
Note that performance degrades dramatically if N becomes large (say >1000).

MATLAB setting matrix values in an array

I'm trying to write some code to calculate a cumulative distribution function in matlab. When I try to actually put my results into an array it yells at me.
tempnum = ordered1(1);
k=2;
while(k<538)
count = 1;
while(ordered1(k)==tempnum)
count = count + 1;
k = k + 1;
end
if(ordered1(k)~=tempnum)
output = [output;[(count/537),tempnum]];
k = k + 1;
tempnum = ordered1(k);
end
end
The errors I'm getting look like this
??? Error using ==> vertcat
CAT arguments dimensions are not consistent.
Error in ==> lab8 at 1164
output = [output;[(count/537),tempnum]];
The line to add to the output matrice was given to me by my TA. He didn't teach us much syntax throughout the year so I'm not really sure what I'm doing wrong. Any help is greatly appreciated.
If you're building the matrix output from scratch, you should make sure it hasn't already been initialized to anything. To do this, you can set it to the empty matrix at the beginning of your code:
output = [];
Also, if you know how large output is going to be, your code will run more efficiently if you preallocate the array output and index into the array to assign values instead of appending values to it. In your case, output should have the same number of rows as there are unique values in the array ordered1, so you could use the function UNIQUE to preallocate output:
nRows = numel(unique(ordered1)); %# Get the number of unique values
output = zeros(nRows,2); %# Initialize output
You would then have to keep a separate counter (say r) to track which index into output you will be adding to next:
...
output(r,:) = [count/537 tempnum]; %# Overwrite a row in output
r = r+1; %# Increment the row index
...
Some additional advice...
Even if you solve the error you are getting, you're going to run into more with the code you have above:
I believe you are actually computing a probability density function (or PDF) with your code. In order to get the cumulative distribution function (or CDF), you have to perform a cumulative sum over the final values in the first column of output. You can do this with the function CUMSUM:
output(:,1) = cumsum(output(:,1));
Your loop will throw an error when it reaches the last element of ordered1. The value of k can become 538 in your inner while loop, which will then cause an error to be thrown when you try to access ordered1(k) anywhere. To get around this, you will have to add checks to the value of k at a number of points in your code. One such point is your inner while loop, which can be rewritten as:
while (k <= 537) && (ordered1(k) == tempnum)
count = count + 1;
k = k + 1;
end
This solution uses the short-circuit AND operator &&, which will first check if (k <= 537) is true or false. If it is false (i.e. k > 537), the second logical check is skipped since its result doesn't matter, and you avoid the error that would result from evaluating ordered1(k).
Bonus MATLAB coolness...
MATLAB has a lot of cool functions that can do a lot of the work for you. One such function is ACCUMARRAY. Your TA may want you to do things using loops like you have above, but you can actually reduce your whole code to just a few lines like so:
nValues = numel(ordered1); %# Get the number of values
p = accumarray(ordered1,ones(size(ordered1)))./nValues; %# Create a PDF
output = [cumsum(p) unique(ordered1)]; %# Create the CDF output

How to efficiently find correlation and discard points outside 3-sigma range in MATLAB?

I have a data file m.txt that looks something like this (with a lot more points):
286.842995
3.444398
3.707202
338.227797
3.597597
283.740414
3.514729
3.512116
3.744235
3.365461
3.384880
Some of the values (like 338.227797) are very different from the values I generally expect (smaller numbers).
So, I am thinking that
I will remove all the points that lie outside the 3-sigma range. How can I do that in MATLAB?
Also, the bigger problem is that this file has a separate file t.txt associated with it which stores the corresponding time values for these numbers. So, I'll have to remove the corresponding time values from the t.txt file also.
I am still learning MATLAB, and I know there would be some good way of doing this (better than storing indices of the elements that were removed from m.txt and then removing those elements from the t.txt file)
#Amro is close, but the FIND is unnecessary (look up logical subscripting) and you need to include the mean for a true +/-3 sigma range. I would go with the following:
%# load files
m = load('m.txt');
t = load('t.txt');
%# find values within range
z = 3;
meanM = mean(m);
sigmaM = std(m);
I = abs(m - meanM) <= z * sigmaM;
%# keep values within range
m = m(I);
t = t(I);
%# load files
m = load('m.txt');
t = load('t.txt');
%# find outliers indices
z = 3;
idx = find( abs(m-mean(m)) > z*std(m) );
%# remove them from both data and time values
m(idx) = [];
t(idx) = [];