I want to create a MATLAB function to import data from files in another directory and fit them to a given model, but because the data need to be filtered (there's "thrash" data in different places in the files, eg. measurements of nothing before the analyzed motion starts).
So the vectors that contain the data used to fit end up having different lengths and so I can't return them in a matrix (eg. x in my function below). How can I solve this?
I have a lot of datafiles so I don't want to use a "manual" method. My function is below. All and suggestions are welcome.
datafit.m
function [p, x, y_c, y_func] = datafit(pattern, xcol, ycol, xfilter, calib, p_calib, func, p_0, nhl)
datafiles = dir(pattern);
path = fileparts(pattern);
p = NaN(length(datafiles));
y_func = [];
for i = 1:length(datafiles)
exist(strcat(path, '/', datafiles(i).name));
filename = datafiles(i).name;
data = importdata(strcat(path, '/', datafiles(i).name), '\t', nhl);
filedata = data.data/1e3;
xdata = filedata(:,xcol);
ydata = filedata(:,ycol);
filter = filedata(:,xcol) > xfilter(i);
x(i,:) = xdata(filter);
y(i,:) = ydata(filter);
y_c(i,:) = calib(y(i,:), p_calib);
error = #(par) sum(power(y_c(i,:) - func(x(i,:), par),2));
p(i,:) = fminsearch(error, p_0);
y_func = [y_func; func(x(i,:), p(i,:))];
end
end
sample data: http://hastebin.com/mokocixeda.md
There are two strategies I can think of:
I would return the data in a vector of cells instead, where the individual cells store vectors of different lengths. You can access data the same way as arrays, but use curly braces: Say c{1}=[1 2 3], c{2}=[1 2 10 8 5] c{3} = [ ].
You can also filter the trash data upon reading a line, if that makes your vectors have the same length.
If memory is not an major issue, try filling up the vectors with distinct values, such as NaN or Inf - anything, that is not found in your measurements based on their physical context. You might need to identify the longest data-set before you allocate memory for your matrices (*). This way, you can use equally sized matrices and easily ignore the "empty data" later on.
(*) Idea ... allocate memory based on the size of the largest file first. Fill it up with e.g. NaN's
matrix = zeros(length(datafiles), longest_file_line_number) .* NaN;
Then run your function. Determine the length of the longest consecutive set of data.
new_max = length(xdata(filter));
if new_max > old_max
old_max = new_max;
end
matrix(i, length(xdata(filter))) = xdata(filter);
Crop your matrix accordingly, before the function returns it ...
matrix = matrix(:, 1:old_max);
Related
Setup
I have an array of captured data. The data may be captured on just 1 device or up to a dozen devices, with each device being a column in the array. I have a prior statement which I execute on the array to then turn it into a logical array to find particular points of interest in the data. Due to the nature of the data, there are many 0's and only a few 1's. I need to return an array with the indices of the 1's so I can go back and capture the data between those points (see update below).
find is an obvious choice for a function - however, the result I need, needs to have 1 column for each device. Normally find will do a linear index regardless of the dimensions of the array.
The devices follow a pattern - but aren't exactly the same. So, complicating this is the fact that the number of 1's in each column is close to, but not guaranteed to be exactly the same depending on the exact timing the data capture is stopped (they are most often different from each other by 1 element, but could be different by more).
MATLAB CODE ATTEMPTS
Because of that difference, I can't use the following simple code:
for p = 1:np
indices( :, p ) = find( device.data.cross( :, p ) );
end
Notes:
np is the number of columns in the data = number of devices captured.
devices is a class representing the collection of devices
data is a TimeTable containing captured data on all the devices
cross is a column in the data TimeTable which contains the logical array
Even this simple code is inefficient and generates the Code Analyzer warning:
The variable 'indices' appears to change size on every loop
iteration (within a script). Consider preallocating for speed.
As expected, it doesn't work as I get an error similar to the following:
Unable to perform assignment because the size of the left side is
448-by-1 and the size of the right side is 449-by-1.
I know why I get this error - each column in an array in MATLAB has to have the same number of rows, so I can't make the assignment if the row size doesn't match. I need to pad the "short" columns somehow. In this case, repeating the last index will work for my later operations without causing an error.
I can't figure out a "good" way to do this. I can't pre-populate the array rows because I don't know how many rows there will be until I've done the find operation.
I can change the code as follows:
indices = [];
for p = 1:np
tempindices = find( devices.data.cross(:, p) );
sizediff = size( tempindices, 1 ) - size( indices, 1 );
if p > 1
if sizediff > 0
padding = repmat(indices(end, 1:(p - 1)), sizediff, 1);
indices = [indices; padding];
elseif sizediff < 0
padding = repmat(tempindices(end), abs(sizediff), 1);
tempindices = [tempindices; padding];
end
end
indices(:,p) = tempindices;
end
Note: padarray would have been useful here, but I don't have the Image Processing Toolbox so I cannot use it.
This code works, but it is very inefficient, it creates multiple otherwise unneeded variables in the workspace and generates multiple "appears to change size on every loop iteration" warnings in Code Analyzer. Is there a more efficient way to do this?
Update / Additional Information:
Some more context is needed for my issue. Given that devices.data.cross is a logical array, to just "pick" the data I want from other columns in my table (as I originally described my problem) I could select a column from devices.data.cross and pass that logical column as a subscript to get that data. I do that where it works. However, for some of the columns I need to select "chunks" of the data between the indices and that's where (I think) I need the indices. Or, at least I don't know of another way to do it.
Here is example code of how I use the indices:
for p = 1:np
for i = 2:num_indices
these_indices = indices(i-1, p):( indices(i, p) - 1 );
rmsvoltage = sqrt( mean( devices.data.voltage(these_indices).^2 ) );
end
end
This is just one routine I do on the "chunks" of data. I also have a couple of functions where these chunks of data are passed for processing.
When I understood your problem correctly, the code below should work. I'm using the approach that Cris Luengo suggested in a comment under your question.
Key element is [rowIdcs, colIdcs] = find( cros ); which gives you the subscripts of positions in cros having a value of one. Please find further comments inline.
% Create some data for testing
volt = randn(10,10);
cros = randi(10,10,10) > 9;
% Get rowIdcs and colIdcs, which have both a size of Nx1,
% with N denoting the number of ones in the mask.
% rowIdcs and colIdcs are the subscripts of the ones in the mask.
[rowIdcs, colIdcs] = find( cros );
% The number of chunks is equal to number N of ones found in the mask;
numChunks = numel( rowIdcs );
% Initilize a vector for the rms
rms = zeros( numChunks, 1 );
% Loop over the chunks
for k = 1 : numChunks
curRow = rowIdcs(k);
curCol = colIdcs(k);
% Get indices of range over neighbouring rows
chunkRowIdcs = curRow + [-1 0 1]; %i.e. these_indices in your example
% Remove indices that are out of range
chunkRowIdcs( chunkRowIdcs < 1 | chunkRowIdcs > size(volt,1) ) = [];
% Get voltages covered by chunk
chunkVoltages = volt( chunkRowIdcs, curCol );
% Get RMS over voltages
rms(k) = sqrt( mean( chunkVoltages(:).^2 ));
end
I am working with lung data sets in matlab, but I need to sort the slices correctly and show them.
I knew that can be done using the "instance number" parameter in Dicom header, but I did not manage to run the correct code.
How can I do that?
Here is my piece of code:
Dicom_directory = uigetdir();
sdir = strcat(Dicom_directory,'\*.dcm');
files = dir(sdir);
I = strcat(Dicom_directory, '\',files(i).name);
x = repmat(double(0), [512 512 1 ]);
x(:,:,1) = double(dicomread(I));
axes(handles.axes1);
imshow(x,[]);
First of all, to get the DICOM header, you need to use dicominfo which will return a struct containing each of the fields. If you want to use the InstanceNumber field to sort by, then you can do this in such a way.
%// Get all of the files
directory = uigetdir();
files = dir(fullfile(directory, '*.dcm'));
filenames = cellfun(#(x)fullfile(directory, x), {files.name}, 'uni', 0);
%// Ensure that they are actually DICOM files and remove the ones that aren't
notdicom = ~cellfun(#isdicom, filenames);
files(notdicom) = [];
%// Now load all the DICOM headers into an array of structs
infos = cellfun(#dicominfo, filenames);
%// Now sort these by the instance number
[~, inds] = sort([infos.InstanceNumber]);
infos = infos(inds);
%// Now you can loop through and display them
dcm = dicomread(infos(1));
him = imshow(dcm, []);
for k = 1:numel(infos)
set(him, 'CData', dicomread(infos(k)));
pause(0.1)
end
That being said, you have to be careful sorting DICOMs using the InstanceNumber. This is not a robust way of doing it because the "InstanceNumber" can refer to the same image acquired over time or different slices throughout a 3D volume. If you want one or the other, I would choose something more specific.
If you want to sort physical slices, I would recommend sorting by the SliceLocation field (if available). If sorting by time, you could use TriggerTime (if available).
Also you will need to consider that there could also potentially be multiple series in your folder so maybe consider using the SeriesNumber to differentiate these.
What is the best way to do random sample with replacement from dataset? I am using 316 * 34 as my dataset. I want to segment the data into three buckets but with replacement. Should I use the randperm because I need to make sure I keep the index intact where that index would be handy in identifying the label data. I am new to matlab I saw there are couple of random sample methods but they didn't look like its doing what I am looking for, its strange to think that something like doesn't exist in matlab, but I did the follwoing:
My issue is when I do this row_idx = round(rand(1)*316) sometimes I get zero, that leads to two questions
what should I do to avoid zeor?
What is the best way to do the random sample with replacement.
shuffle_X = X(randperm(size(X,1)),:);
lengthOf_shuffle_X = length(shuffle_X)
number_of_rows_per_bucket = round(lengthOf_shuffle_X / 3)
bucket_cell = cell(3,1)
bag_matrix = []
for k = 1:length(bucket_cell)
for i = 1:number_of_rows_per_bucket
row_idx = round(rand(1)*316)
bag_matrix(i,:) = shuffle_X(row_idx,:)
end
bucket_cell{k} = bag_matrix
end
I could do following:
if row_idx == 0
row_idx = round(rand(1)*316)
assuming random number will never give two zeros values in two consecutive rounds.
randi is a good way to get integer indices for sampling with replacement. Assuming you want to fill three buckets with an equal number of samples, then you can write
data = rand(316,34); %# create some dummy data
number_of_data = size(data,1);
number_of_rows_per_bucket = 50;
bucket_cell = cell(1,3);
idx = randi([1,number_of_data],[number_of_rows_per_bucket,3]);
for iBucket = 1:3
bucket_cell{iBucket} = data(idx(:,iBucket),:);
end
To the question: if you use randperm it will give you a draw order without replacement, since you can draw any item once.
If you use randi it draws you with replacement, that is you draw an item possibly many times.
If you want to "segment" a dataset, that usually means you split the dataset into three distinct sets. For that you use draw without replacement (you don't put the items back; use randperm). If you'd do it with replacement (using randi), it will be incredibly slow, since after some time the chance that you draw an item which you have not before is very low.
(Details in coupon collector ).
If you need a segmentation that is a split, you can just go over the elements and independently decide where to put it. (That is you choose a bucket for each item with replacement -- that is you put any chosen bucket back into the game.)
For that:
% if your data items are vectors say data = [1 1; 2 2; 3 3; 4 4]
num_data = length(data);
bucket_labels = randi(3,[1,num_data]); % draw a bucket label for each item, independently.
for i=1:3
bucket{i} = data(bucket_labels==i,:);
end
%if your data items are scalars say data = [1 2 3 4 5]
num_data = length(data);
bucket_labels = randi(3,[1,num_data]);
for i=1:3
bucket{i} = data(bucket_labels==i);
end
there we go.
I am trying to deal with a very large dataset. I have k = ~4200 matrices (varying sizes) which must be compared combinatorially, skipping non-unique and self comparisons. Each of k(k-1)/2 comparisons produces a matrix, which must be indexed against its parents (i.e. can find out where it came from). The convenient way to do this is to (triangularly) fill a k-by-k cell array with the result of each comparison. These are ~100 X ~100 matrices, on average. Using single precision floats, it works out to 400 GB overall.
I need to 1) generate the cell array or pieces of it without trying to place the whole thing in memory and 2) access its elements (and their elements) in like fashion. My attempts have been inefficient due to reliance on MATLAB's eval() as well as save and clear occurring in loops.
for i=1:k
[~,m] = size(data{i});
cur_var = ['H' int2str(i)];
%# if i == 1; save('FileName'); end; %# If using a single MAT file and need to create it.
eval([cur_var ' = cell(1,k-i);']);
for j=i+1:k
[~,n] = size(data{j});
eval([cur_var '{i,j} = zeros(m,n,''single'');']);
eval([cur_var '{i,j} = compare(data{i},data{j});']);
end
save(cur_var,cur_var); %# Add '-append' when using a single MAT file.
clear(cur_var);
end
The other thing I have done is to perform the split when mod((i+j-1)/2,max(factor(k(k-1)/2))) == 0. This divides the result into the largest number of same-size pieces, which seems logical. The indexing is a little more complicated, but not too bad because a linear index could be used.
Does anyone know/see a better way?
Here's a version that combines going fast with using minimal memory.
I use fwrite/fread so that you still can use parfor (and this time, I made sure it works :) )
%# assume data is loaded an k is known
%# find the index pairs for comparisons. This could be done more elegantly, I guess.
%# I'm constructing a lower triangular array, i.e. an array that has ones wherever
%# we want to compare i (row) and j (col). Then I use find to get i and j
[iIdx,jIdx] = find(tril(ones(k,k),-1));
%# create a directory to store the comparisons
mkdir('H_matrix_elements')
savePath = fullfile(pwd,'H_matrix_elements');
%# loop through all comparisons in parallel. This way there may be a bit more overhead from
%# the individual function calls. However, parfor is most efficient if there are
%# a lot of relatively similarly fast iterations.
parfor ct = 1:length(iIdx)
%# make the comparison - do double b/c there shouldn't be a memory issue
currentComparison = compare(data{iIdx(ct)},data{jIdx{ct});
%# create save-name as H_i_j, e.g. H_104_23
saveName = fullfile(savePath,sprintf('H_%i_%i',iIdx(ct),jIdx(ct)));
%# save. Since 'save' is not allowed, use fwrite to write the data to disk
fid = fopen(saveName,'w');
%# for simplicity: save data as vector, add two elements to the beginning
%# to store the size of the array
fwrite(fid,[size(currentComparison)';currentComparison(:)]); % ' #SO formatting
%# close file
fclose(fid)
end
%# to read e.g. comparison H_104_23
fid = fopen(fullfile(savePath,'H_104_23'),'r');
tmp = fread(fid);
fclose(fid);
%# reshape into 2D array.
data = reshape(tmp(3:end),tmp(1),tmp(2));
You can get rid of the eval and clear calls by assigning the filename separately.
for i=1:k
[~,m] = size(data{i});
file_name = ['H' int2str(i)];
cur_var = cell(1, k-i);
for j=i+1:k
[~,n] = size(data{j});
cur_var{i,j} = zeros(m, n, 'single');
cur_var{i,j} = compare(data{i}, data{j});
end
save(file_name, cur_var);
end
If you need the saved variables to take different names, use the -struct option to save.
str.(file_name);
save(file_name, '-struct', str);
I have a data file m.txt that looks something like this (with a lot more points):
286.842995
3.444398
3.707202
338.227797
3.597597
283.740414
3.514729
3.512116
3.744235
3.365461
3.384880
Some of the values (like 338.227797) are very different from the values I generally expect (smaller numbers).
So, I am thinking that
I will remove all the points that lie outside the 3-sigma range. How can I do that in MATLAB?
Also, the bigger problem is that this file has a separate file t.txt associated with it which stores the corresponding time values for these numbers. So, I'll have to remove the corresponding time values from the t.txt file also.
I am still learning MATLAB, and I know there would be some good way of doing this (better than storing indices of the elements that were removed from m.txt and then removing those elements from the t.txt file)
#Amro is close, but the FIND is unnecessary (look up logical subscripting) and you need to include the mean for a true +/-3 sigma range. I would go with the following:
%# load files
m = load('m.txt');
t = load('t.txt');
%# find values within range
z = 3;
meanM = mean(m);
sigmaM = std(m);
I = abs(m - meanM) <= z * sigmaM;
%# keep values within range
m = m(I);
t = t(I);
%# load files
m = load('m.txt');
t = load('t.txt');
%# find outliers indices
z = 3;
idx = find( abs(m-mean(m)) > z*std(m) );
%# remove them from both data and time values
m(idx) = [];
t(idx) = [];