What is the best way to do random sample with replacement from dataset? I am using 316 * 34 as my dataset. I want to segment the data into three buckets but with replacement. Should I use the randperm because I need to make sure I keep the index intact where that index would be handy in identifying the label data. I am new to matlab I saw there are couple of random sample methods but they didn't look like its doing what I am looking for, its strange to think that something like doesn't exist in matlab, but I did the follwoing:
My issue is when I do this row_idx = round(rand(1)*316) sometimes I get zero, that leads to two questions
what should I do to avoid zeor?
What is the best way to do the random sample with replacement.
shuffle_X = X(randperm(size(X,1)),:);
lengthOf_shuffle_X = length(shuffle_X)
number_of_rows_per_bucket = round(lengthOf_shuffle_X / 3)
bucket_cell = cell(3,1)
bag_matrix = []
for k = 1:length(bucket_cell)
for i = 1:number_of_rows_per_bucket
row_idx = round(rand(1)*316)
bag_matrix(i,:) = shuffle_X(row_idx,:)
end
bucket_cell{k} = bag_matrix
end
I could do following:
if row_idx == 0
row_idx = round(rand(1)*316)
assuming random number will never give two zeros values in two consecutive rounds.
randi is a good way to get integer indices for sampling with replacement. Assuming you want to fill three buckets with an equal number of samples, then you can write
data = rand(316,34); %# create some dummy data
number_of_data = size(data,1);
number_of_rows_per_bucket = 50;
bucket_cell = cell(1,3);
idx = randi([1,number_of_data],[number_of_rows_per_bucket,3]);
for iBucket = 1:3
bucket_cell{iBucket} = data(idx(:,iBucket),:);
end
To the question: if you use randperm it will give you a draw order without replacement, since you can draw any item once.
If you use randi it draws you with replacement, that is you draw an item possibly many times.
If you want to "segment" a dataset, that usually means you split the dataset into three distinct sets. For that you use draw without replacement (you don't put the items back; use randperm). If you'd do it with replacement (using randi), it will be incredibly slow, since after some time the chance that you draw an item which you have not before is very low.
(Details in coupon collector ).
If you need a segmentation that is a split, you can just go over the elements and independently decide where to put it. (That is you choose a bucket for each item with replacement -- that is you put any chosen bucket back into the game.)
For that:
% if your data items are vectors say data = [1 1; 2 2; 3 3; 4 4]
num_data = length(data);
bucket_labels = randi(3,[1,num_data]); % draw a bucket label for each item, independently.
for i=1:3
bucket{i} = data(bucket_labels==i,:);
end
%if your data items are scalars say data = [1 2 3 4 5]
num_data = length(data);
bucket_labels = randi(3,[1,num_data]);
for i=1:3
bucket{i} = data(bucket_labels==i);
end
there we go.
Related
In Matlab (R2021b) I am using some given function, which reads time-dependent values of several variables and returns them in a combined matrix together with a time vector. In the data matrix each column represents one vector of time-dependent values for one variable.
[data,time] = function_reading_data_of_several_values('filename');
For readability of the following code where the variables are further processed, I would like to store these columns in separate vector variables. I am doing it like that:
MomentX = data(1,:);
MomentY = data(2,:);
MomentZ = data(3,:);
ForceX = data(4,:);
ForceY = data(5,:);
ForceZ = data(6,:);
That is working. But is there some simpler (or shorter) way of assigning the column of the matrix to individual vectors? I am asking because in real program I have more than the 6 columns as in the example. Code is getting quite long. I was thinking of something similar to the line below, but that does not work:
[MomentX,MomentY,MomentZ,ForceX,ForceY,ForceZ] = data; %does not work
Do you have any idea? Thanks for help!
Update:
Thanks to the hint here in the group to use tables, a solution could be like this:
...
[data,time] = function_reading_data_of_several_values('filename');
% data in matrix. Each column representing a stime dependent variable
varNames = {'MomentX', 'MomentX',...}; % Names of columns
T=array2table(data','VariableNames',varNames); % Transform to Table
Stress = T.MomentX/W + T.ForceY/A %accesing table columns
...
This seems to work fine and readable to me.
Solution 1: In industrial solutions like dSpace, it is very common to do it in struct arrays:
mydata.X(1).time = [0.01 0.02 0.03 0.04];
mydata.Y(1).name = 'MomentX';
mydata.Y(1).data = [1 2 3 4];
mydata.Y(2).name = 'MomentY';
mydata.Y(2).data = [2 3 4 5];
Solution 2: It is also very common to create tables
See: https://de.mathworks.com/help/matlab/ref/table.html
As already commented, it is probably better to use a table instead of separate variables may not be a good idea. But if you want to, it can be done this way:
A = magic(6): % example 6-column matrix
A_cell = num2cell(A, 1); % separate columns in cells
[MomentX, MomentY, MomentZ, ForceX, ForceY, ForceZ] = A_cell{:};
This is almost the same as your
[MomentX,MomentY,MomentZ,ForceX,ForceY,ForceZ] = data; %does not work
except that the right-hand side needs to be a comma-separated list, which in this case is obtained from a cell array.
Setup
I have an array of captured data. The data may be captured on just 1 device or up to a dozen devices, with each device being a column in the array. I have a prior statement which I execute on the array to then turn it into a logical array to find particular points of interest in the data. Due to the nature of the data, there are many 0's and only a few 1's. I need to return an array with the indices of the 1's so I can go back and capture the data between those points (see update below).
find is an obvious choice for a function - however, the result I need, needs to have 1 column for each device. Normally find will do a linear index regardless of the dimensions of the array.
The devices follow a pattern - but aren't exactly the same. So, complicating this is the fact that the number of 1's in each column is close to, but not guaranteed to be exactly the same depending on the exact timing the data capture is stopped (they are most often different from each other by 1 element, but could be different by more).
MATLAB CODE ATTEMPTS
Because of that difference, I can't use the following simple code:
for p = 1:np
indices( :, p ) = find( device.data.cross( :, p ) );
end
Notes:
np is the number of columns in the data = number of devices captured.
devices is a class representing the collection of devices
data is a TimeTable containing captured data on all the devices
cross is a column in the data TimeTable which contains the logical array
Even this simple code is inefficient and generates the Code Analyzer warning:
The variable 'indices' appears to change size on every loop
iteration (within a script). Consider preallocating for speed.
As expected, it doesn't work as I get an error similar to the following:
Unable to perform assignment because the size of the left side is
448-by-1 and the size of the right side is 449-by-1.
I know why I get this error - each column in an array in MATLAB has to have the same number of rows, so I can't make the assignment if the row size doesn't match. I need to pad the "short" columns somehow. In this case, repeating the last index will work for my later operations without causing an error.
I can't figure out a "good" way to do this. I can't pre-populate the array rows because I don't know how many rows there will be until I've done the find operation.
I can change the code as follows:
indices = [];
for p = 1:np
tempindices = find( devices.data.cross(:, p) );
sizediff = size( tempindices, 1 ) - size( indices, 1 );
if p > 1
if sizediff > 0
padding = repmat(indices(end, 1:(p - 1)), sizediff, 1);
indices = [indices; padding];
elseif sizediff < 0
padding = repmat(tempindices(end), abs(sizediff), 1);
tempindices = [tempindices; padding];
end
end
indices(:,p) = tempindices;
end
Note: padarray would have been useful here, but I don't have the Image Processing Toolbox so I cannot use it.
This code works, but it is very inefficient, it creates multiple otherwise unneeded variables in the workspace and generates multiple "appears to change size on every loop iteration" warnings in Code Analyzer. Is there a more efficient way to do this?
Update / Additional Information:
Some more context is needed for my issue. Given that devices.data.cross is a logical array, to just "pick" the data I want from other columns in my table (as I originally described my problem) I could select a column from devices.data.cross and pass that logical column as a subscript to get that data. I do that where it works. However, for some of the columns I need to select "chunks" of the data between the indices and that's where (I think) I need the indices. Or, at least I don't know of another way to do it.
Here is example code of how I use the indices:
for p = 1:np
for i = 2:num_indices
these_indices = indices(i-1, p):( indices(i, p) - 1 );
rmsvoltage = sqrt( mean( devices.data.voltage(these_indices).^2 ) );
end
end
This is just one routine I do on the "chunks" of data. I also have a couple of functions where these chunks of data are passed for processing.
When I understood your problem correctly, the code below should work. I'm using the approach that Cris Luengo suggested in a comment under your question.
Key element is [rowIdcs, colIdcs] = find( cros ); which gives you the subscripts of positions in cros having a value of one. Please find further comments inline.
% Create some data for testing
volt = randn(10,10);
cros = randi(10,10,10) > 9;
% Get rowIdcs and colIdcs, which have both a size of Nx1,
% with N denoting the number of ones in the mask.
% rowIdcs and colIdcs are the subscripts of the ones in the mask.
[rowIdcs, colIdcs] = find( cros );
% The number of chunks is equal to number N of ones found in the mask;
numChunks = numel( rowIdcs );
% Initilize a vector for the rms
rms = zeros( numChunks, 1 );
% Loop over the chunks
for k = 1 : numChunks
curRow = rowIdcs(k);
curCol = colIdcs(k);
% Get indices of range over neighbouring rows
chunkRowIdcs = curRow + [-1 0 1]; %i.e. these_indices in your example
% Remove indices that are out of range
chunkRowIdcs( chunkRowIdcs < 1 | chunkRowIdcs > size(volt,1) ) = [];
% Get voltages covered by chunk
chunkVoltages = volt( chunkRowIdcs, curCol );
% Get RMS over voltages
rms(k) = sqrt( mean( chunkVoltages(:).^2 ));
end
This question already has answers here:
Get the indices of the n largest elements in a matrix
(4 answers)
Closed 6 years ago.
When using a binary image with several lines I know that this code displays the longest line:
lineStats = regionprops(imsk, {'Area','PixelIdxList'});
[length, index] = max([lineStats.Area]);
longestLine = zeros(size(imsk));
longestLine(lineStats(index).PixelIdxList)=1;
figure
imshow(longestLine)
Is there a way to display the second longest line? I need to display a line that is a little shorter than the longest line in order to connect them.
EDIT: Is there a way to display both lines on the binary image figure?
Thank you.
I would set the longest line to zero and use max again, after I copy the original vector.
lineStats = regionprops(imsk, {'Area','PixelIdxList'});
[length, index] = max([lineStats.Area]);
lineAreas = [lineStats.Area]; %copy all lineStats.Area values into a new vector
lineAreas(index) = NaN; %remove the longest line by setting it to not-a-number
[length2, index2] = max(lineAreas);
EDIT: Response to new question
sort may be a more straight forward approach for multiples, but you can still use max.
lineAreas = [lineStats.Area]; %copy all lineStats.Area values into a new vector
% add a for loop that iteratively stores the desired indices
nLines = 3;
index = zeros(1,nLines);
for iLines = 1:nLines
[length, index(iLines)] = max(lineAreas);
lineAreas(index) = NaN; %remove the longest line by setting it to not-a-number
end
longestLine = zeros(size(imsk));
% I cannot be certain this will work since your example is not reproducible
longestLine([lineStats(index).PixelIdxList]) = 1;
figure
imshow(longestLine)
Instead of using max use sort in descending order and take the second element. Like max, sort also provides the indexes of the returned values, so the two functions are pretty compatible.
eStats = regionprops(imsk, {'Area','PixelIdxList'});
[length, index] = sort([lineStats.Area], 'descend');
longestLine = zeros(size(imsk));
longestLine(lineStats(index(2)).PixelIdxList)=1; % here take the second largest
figure
imshow(longestLine)
As an alternative with focus on performance and ease of use, here's one approach using bwlabel instead of regionprops -
[L, num] = bwlabel(imsk, 8);
count_pixels_per_obj = sum(bsxfun(#eq,L(:),1:num));
[~,sidx] = sort(count_pixels_per_obj,'descend');
N = 3; % Shows N biggest objects/lines
figure,imshow(ismember(L,sidx(1:N))),title([num2str(N) ' biggest blobs'])
On the performance aspect, here's one post that does some benchmarking on snowflakes and coins images from MATLAB's image gallery.
Sample run -
imsk = im2bw(imread('coins.png')); %%// Coins photo from MATLAB Library
N = 2:
N = 3:
I want to create a MATLAB function to import data from files in another directory and fit them to a given model, but because the data need to be filtered (there's "thrash" data in different places in the files, eg. measurements of nothing before the analyzed motion starts).
So the vectors that contain the data used to fit end up having different lengths and so I can't return them in a matrix (eg. x in my function below). How can I solve this?
I have a lot of datafiles so I don't want to use a "manual" method. My function is below. All and suggestions are welcome.
datafit.m
function [p, x, y_c, y_func] = datafit(pattern, xcol, ycol, xfilter, calib, p_calib, func, p_0, nhl)
datafiles = dir(pattern);
path = fileparts(pattern);
p = NaN(length(datafiles));
y_func = [];
for i = 1:length(datafiles)
exist(strcat(path, '/', datafiles(i).name));
filename = datafiles(i).name;
data = importdata(strcat(path, '/', datafiles(i).name), '\t', nhl);
filedata = data.data/1e3;
xdata = filedata(:,xcol);
ydata = filedata(:,ycol);
filter = filedata(:,xcol) > xfilter(i);
x(i,:) = xdata(filter);
y(i,:) = ydata(filter);
y_c(i,:) = calib(y(i,:), p_calib);
error = #(par) sum(power(y_c(i,:) - func(x(i,:), par),2));
p(i,:) = fminsearch(error, p_0);
y_func = [y_func; func(x(i,:), p(i,:))];
end
end
sample data: http://hastebin.com/mokocixeda.md
There are two strategies I can think of:
I would return the data in a vector of cells instead, where the individual cells store vectors of different lengths. You can access data the same way as arrays, but use curly braces: Say c{1}=[1 2 3], c{2}=[1 2 10 8 5] c{3} = [ ].
You can also filter the trash data upon reading a line, if that makes your vectors have the same length.
If memory is not an major issue, try filling up the vectors with distinct values, such as NaN or Inf - anything, that is not found in your measurements based on their physical context. You might need to identify the longest data-set before you allocate memory for your matrices (*). This way, you can use equally sized matrices and easily ignore the "empty data" later on.
(*) Idea ... allocate memory based on the size of the largest file first. Fill it up with e.g. NaN's
matrix = zeros(length(datafiles), longest_file_line_number) .* NaN;
Then run your function. Determine the length of the longest consecutive set of data.
new_max = length(xdata(filter));
if new_max > old_max
old_max = new_max;
end
matrix(i, length(xdata(filter))) = xdata(filter);
Crop your matrix accordingly, before the function returns it ...
matrix = matrix(:, 1:old_max);
Okay so this sounds easy but no matter how many times I have tried I still cannot get it to plot correctly. I need only 3 lines on the same graph however still have an issue with it.
iO = 2.0e-6;
k = 1.38e-23;
q = 1.602e-19;
for temp_f = [75 100 125]
T = ((5/9)*temp_f-32)+273.15;
vd = -1.0:0.01:0.6;
Id = iO*(exp((q*vd)/(k*T))-1);
plot(vd,Id,'r',vd,Id,'y',vd,Id,'g');
legend('amps at 75 F', 'amps at 100 F','amps at 125 F');
end;
ylabel('Amps');
xlabel('Volts');
title('Current through diode');
Now I know the plot function that is currently in their isn't working and that some kind of variable needs setup like (vd,Id1,'r',vd,Id2,'y',vd,Id3,'g'); however I really can't grasp the concept of changing it and am seeking help.
You can use the "hold on" function to make it so each plot command plots on the same window as the last.
It would be better to skip the for loop and just do this all in one step though.
iO = 2.0e-6;
k = 1.38e-23;
q = 1.602e-19;
temp_f = [75 100 125];
T = ((5/9)*temp_f-32)+273.15;
vd = -1.0:0.01:0.6;
% Convert this 1xlength(vd) vector to a 3xlength(vd) vector by copying it down two rows.
vd = repmat(vd,3,1);
% Convert this 1x3 array to a 3x1 array.
T=T';
% and then copy it accross to length(vd) so each row is all the same value from the original T
T=repmat(T,1,length(vd));
%Now we can calculate Id all at once.
Id = iO*(exp((q*vd)./(k*T))-1);
%Then plot each row of the Id matrix as a seperate line. Id(1,:) means 1st row, all columns.
plot(vd,Id(1,:),'r',vd,Id(2,:),'y',vd,Id(3,:),'g');
ylabel('Amps');
xlabel('Volts');
title('Current through diode');
And that should get what you want.