Group variables based on lengths of specific arrays - matlab

I have a long list of variables in a dataset which contains multiple time channels with different sampling rates, such as time_1, time_2, TIME, Time, etc. There are also multiple other variables that are dependent on either of these times.
I'd like to list all possible channels that contain 'time' (case-insensitive partial string search within Workspace) and search & match which variable belongs to each item of this time list, based on the size of the variables and then group them in a structure with the values of the variables for later analysis.
For example:
Name Size Bytes Class
ENGSPD_1 181289x1 1450312 double
Eng_Spd 12500x1 100000 double
Speed 41273x1 330184 double
TIME 41273x1 330184 double
Time 12500x1 100000 double
engine_speed_2 1406x1 11248 double
time_1 181289x1 1450312 double
time_2 1406x1 11248 double
In this case, I have 4 time channels with different names & sizes and 4 speed channels which belong to each of these time channels.
whos function is case-sensitive and it will only return the name of the variable, rather than the values of the variable.

As a preamble I'm going to echo my comment from above and earlier comments from folks here and on your other similar questions:
Please stop trying to manipulate your data this way.
It may have made sense at the beginning but, given the questions you've asked on SO to date, this isn't the first time you've encountered issues trying to pull everything together and if you continue this way it's not going to be the last. This approach is highly error prone, unreliable, and unpredictable. Every step of the process requires you to make assumptions about your data that cannot be guaranteed (size of data matching, variables being present and named predictably, etc.). Rather than trying to come up with creative ways to hack together the data, start over and output your data predictably from the beginning. It may take some time but I guarantee it's going to save time in the future and it will make sense to whoever looks at this in 6 months trying to figure out what is going on.
For example, there is absolutely no significant effort needed to output your variables as:
outputstructure.EngineID.time = sometimeseries;
outputstructure.EngineID.speed = somedata;
Where EngineID can be any valid variable name. This is simple and it links your data together permanently and robustly.
That being said, the following will bring a marginal amount of sanity to your data set:
% Build up a totally amorphous data set
ENGSPD_1 = rand(10, 1);
Eng_Spd = rand(20, 1);
Speed = rand(30, 1);
TIME = rand(30, 1);
Time = rand(20, 1);
engine_speed_2 = rand(5, 1);
time_1 = rand(10, 1);
time_2 = rand(5, 1);
% Identify time and speed variable using regular expressions
% Assumes time variables contain 'time' (case insensitive)
% Assumes speed variables contain 'spd', 'sped', or 'speed' (case insensitive)
timevars = whos('-regexp', '[T|t][I|i][M|m][E|e]');
speedvars = whos('-regexp', '[S|s][P|p][E|e]{0,2}[D|d]');
% Pair timeseries and data arrays together. Data is only coupled if
% the number of rows in the timeseries is exactly the same as the
% number of rows in the data array.
timesizes = vertcat(speedvars(:).size); % Concatenate timeseries sizes
speedsizes = vertcat(timevars(:).size); % Concatenate speed array sizes
% Find intersection and their locations in the structures returned by whos
% By using intersect we only get the data that is matched
[sizes, timeidx, speedidx] = intersect(timesizes(:,1), speedsizes(:,1));
% Preallocate structure
ndata = length(sizes);
groupeddata(ndata).time = [];
groupeddata(ndata).speed = [];
% Unavoidable (without saving/loading data) eval loop :|
for ii = 1:ndata
groupeddata(ii).time = eval('timevars(timeidx(ii)).name');
groupeddata(ii).speed = eval('speedvars(speedidx(ii)).name');
end
A non-eval method, by request:
ENGSPD_1 = rand(10, 1);
Eng_Spd = rand(20, 1);
Speed = rand(30, 1);
TIME = rand(30, 1);
Time = rand(20, 1);
engine_speed_2 = rand(5, 1);
time_1 = rand(10, 1);
time_2 = rand(5, 1);
save('tmp.mat')
oldworkspace = load('tmp.mat');
varnames = fieldnames(oldworkspace);
timevars = regexpi(varnames, '.*time.*', 'match', 'once');
timevars(cellfun('isempty', timevars)) = [];
speedvars = regexpi(varnames, '.*spe{0,2}d.*', 'match', 'once');
speedvars(cellfun('isempty', speedvars)) = [];
timesizes = zeros(length(timevars), 2);
for ii = 1:length(timevars)
timesizes(ii, :) = size(oldworkspace.(timevars{ii}));
end
speedsizes = zeros(length(speedvars), 2);
for ii = 1:length(speedvars)
speedsizes(ii, :) = size(oldworkspace.(speedvars{ii}));
end
[sizes, timeidx, speedidx] = intersect(timesizes(:,1), speedsizes(:,1));
ndata = length(sizes);
groupeddata(ndata).time = [];
groupeddata(ndata).speed = [];
for ii = 1:ndata
groupeddata(ii).time = oldworkspace.(timevars{timeidx(ii)});
groupeddata(ii).speed = oldworkspace.(speedvars{speedidx(ii)});
end
See this gist for timing.

Related

Fast way to get mean values of rows accordingly to subscripts

I have a data, which may be simulated in the following way:
N = 10^6;%10^8;
K = 10^4;%10^6;
subs = randi([1 K],N,1);
M = [randn(N,5) subs];
M(M<-1.2) = nan;
In other words, it is a matrix, where the last row is subscripts.
Now I want to calculate nanmean() for each subscript. Also I want to save number of rows for each subscript. I have a 'dummy' code for this:
uniqueSubs = unique(M(:,6));
avM = nan(numel(uniqueSubs),6);
for iSub = 1:numel(uniqueSubs)
tmpM = M(M(:,6)==uniqueSubs(iSub),1:5);
avM(iSub,:) = [nanmean(tmpM,1) size(tmpM,1)];
end
The problem is, that it is too slow. I want it to work for N = 10^8 and K = 10^6 (see commented part in the definition of these variables.
How can I find the mean of the data in a faster way?
This sounds like a perfect job for findgroups and splitapply.
% Find groups in the final column
G = findgroups(M(:,6));
% function to apply per group
fcn = #(group) [mean(group, 1, 'omitnan'), size(group, 1)];
% Use splitapply to apply fcn to each group in M(:,1:5)
result = splitapply(fcn, M(:, 1:5), G);
% Check
assert(isequaln(result, avM));
M = sortrows(M,6); % sort the data per subscript
IDX = diff(M(:,6)); % find where the subscript changes
tmp = find(IDX);
tmp = [0 ;tmp;size(M,1)]; % add start and end of data
for iSub= 2:numel(tmp)
% Calculate the mean over just a single subscript, store in iSub-1
avM2(iSub-1,:) = [nanmean(M(tmp(iSub-1)+1:tmp(iSub),1:5),1) tmp(iSub)-tmp(iSub-1)];tmp(iSub-1)];
end
This is some 60 times faster than your original code on my computer. The speed-up mainly comes from presorting the data and then finding all locations where the subscript changes. That way you do not have to traverse the full array each time to find the correct subscripts, but rather you only check what's necessary each iteration. You thus calculate the mean over ~100 rows, instead of first having to check in 1,000,000 rows whether each row is needed that iteration or not.
Thus: in the original you check numel(uniqueSubs), 10,000 in this case, whether all N, 1,000,000 here, numbers belong to a certain category, which results in 10^12 checks. The proposed code sorts the rows (sorting is NlogN, thus 6,000,000 here), and then loop once over the full array without additional checks.
For completion, here is the original code, along with my version, and it shows the two are the same:
N = 10^6;%10^8;
K = 10^4;%10^6;
subs = randi([1 K],N,1);
M = [randn(N,5) subs];
M(M<-1.2) = nan;
uniqueSubs = unique(M(:,6));
%% zlon's original code
avM = nan(numel(uniqueSubs),7); % add the subscript for comparison later
tic
uniqueSubs = unique(M(:,6));
for iSub = 1:numel(uniqueSubs)
tmpM = M(M(:,6)==uniqueSubs(iSub),1:5);
avM(iSub,:) = [nanmean(tmpM,1) size(tmpM,1) uniqueSubs(iSub)];
end
toc
%%%%% End of zlon's code
avM = sortrows(avM,7); % Sort for comparison
%% Start of Adriaan's code
avM2 = nan(numel(uniqueSubs),6);
tic
M = sortrows(M,6);
IDX = diff(M(:,6));
tmp = find(IDX);
tmp = [0 ;tmp;size(M,1)];
for iSub = 2:numel(tmp)
avM2(iSub-1,:) = [nanmean(M(tmp(iSub-1)+1:tmp(iSub),1:5),1) tmp(iSub)-tmp(iSub-1)];
end
toc %tic/toc should not be used for accurate timing, this is just for order of magnitude
%%%% End of Adriaan's code
all(avM(:,1:6) == avM2) % Do the comparison
% End of script
% Output
Elapsed time is 58.561347 seconds.
Elapsed time is 0.843124 seconds. % ~70 times faster
ans =
1×6 logical array
1 1 1 1 1 1 % i.e. the matrices are equal to one another

Matlab mystery: Comparison is not defined between double and datetime arrays

I'm new to Matlab, troubleshooting a comparison error by chopping it down as simple as can be. This test matrix of two rows works fine, then I copy/paste to add a third record and I get a conversion error (for the same data)!
Background: my data is for patient exams: patient, arrival time, exam start time, and I'm writing a program to determine how full the waiting room gets. If the patient isn't seen instantaneously I add them to a patient_queue (just saving their eventual start time as a placeholder). The data is in order of arrival time, so at each new row I check the queue to see if anyone's exam started in the meantime, remove them, and go from there.
Here's an example with only two patient rows (this works):
data_matrix = [1,735724.291666667,735724.322916667,735724.343750000,5;
2,735724.331250000,735724.340277778,735724.371527778,18];
patient_queue = [];
highest_wait_num = 0;
[rows, columns] = size(data_matrix);
for i = 1:rows
this_row_arrival = datetime (data_matrix(i, 2), 'ConvertFrom', 'datenum');
this_row_exam_start = datetime (data_matrix(i, 3), 'ConvertFrom', 'datenum');
now = this_row_arrival; %making a copy called 'now' for readability below
% check patient queue: if anyone left in the meantime, remove them
for j = 1:length(patient_queue)
if patient_queue{j} < now
patient_queue(j) = [];
end
end
% if new patient isn't seen immediately (tb > ta), add tb to the queue
if this_row_exam_start > this_row_arrival
patient_queue{end+1} = this_row_exam_start;
end
% get the current queue size
patient_queue_non_zero = (~cellfun('isempty',patient_queue));
indices = find(patient_queue_non_zero);
current_queue_count = length(indices);
% if the current queue size beats the highest we've seen, update it
if current_queue_count > highest_wait_num
highest_wait_num = current_queue_count;
end
end
patient_queue{j}
highest_wait_num
But when I use the full data set I get an error at the line:
if patient_queue{j} < now
Comparison is not defined between double and datetime arrays.
So I'm narrowing the problem, and I can even reproduce the error by taking my simple matrix of 2 records that worked, copy the second one to make a matrix of 3, like so- just swapping that in the code above makes the error(!!):
data_matrix = [1,735724.291666667,735724.322916667,735724.343750000,5;
2,735724.331250000,735724.340277778,735724.371527778,18;
2,735724.331250000,735724.340277778,735724.371527778,18]
What am I missing?
Here is a much shorter way to do this, without converting numbers to dates:
patient_queue = [];
highest_wait_num = 0;
rows = size(data_matrix,1);
for k = 1:rows
this_row_arrival = data_matrix(k, 2);
this_row_exam_start = data_matrix(k, 3);
now = data_matrix(k, 2); %making a copy called 'now' for readability below
% check patient queue: if anyone left in the meantime, remove them
patient_queue(patient_queue < now) = [];
% if new patient isn't seen immediately (tb > ta), add tb to the queue
if this_row_exam_start > this_row_arrival
patient_queue(end+1) = this_row_exam_start;
end
% get the current queue size
current_queue_count = numel(nonzeros(patient_queue));
% if the current queue size beats the highest we've seen, update it
highest_wait_num = max(current_queue_count,highest_wait_num);
end
What difference the right braces make :/
I had declared this as a vector instead of a cell array:
patient_queue = [];
As a result, I couldn't get to the value inside.
The answer was to:
- declare it this way instead: patient_queue = {};
- add one more (missing) datetime conversion for patient_queue{j}
This is a specific pair of mistakes I had, so I'll delete the post since it's unlikely to be useful for others

Matlab loop through functions using an array in a for loop

I am writing a code to do some very simple descriptive statistics, but I found myself being very repetitive with my syntax.
I know there's a way to shorten this code and make it more elegant and time efficient with something like a for-loop, but I am not quite keen enough in coding (yet) to know how to do this...
I have three variables, or groups (All data, condition 1, and condition 2). I also have 8 matlab functions that I need to perform on each of the three groups (e.g mean, median). I am saving all of the data in a table where each column corresponds to one of the functions (e.g. mean) and each row is that function performed on the correspnding group (e.g. (1,1) is mean of 'all data', (2,1) is mean of 'cond 1', and (3,1) is mean of 'cond 2'). It is important to preserve this structure as I am outputting to a csv file that I can open in excel. The columns, again, are labeled according the function, and the rows are ordered by 1) all data 2) cond 1, and 3) cond 2.
The data I am working with is in the second column of these matrices, by the way.
So here is the tedious way I am accomplishing this:
x = cell(3,8);
x{1,1} = mean(alldata(:,2));
x{2,1} = mean(cond1data(:,2));
x{3,1} = mean(cond2data(:,2));
x{1,2} = median(alldata(:,2));
x{2,2} = median(cond1data(:,2));
x{3,2} = median(cond2data(:,2));
x{1,3} = std(alldata(:,2));
x{2,3} = std(cond1data(:,2));
x{3,3} = std(cond2data(:,2));
x{1,4} = var(alldata(:,2)); % variance
x{2,4} = var(cond1data(:,2));
x{3,4} = var(cond2data(:,2));
x{1,5} = range(alldata(:,2));
x{2,5} = range(cond1data(:,2));
x{3,5} = range(cond2data(:,2));
x{1,6} = iqr(alldata(:,2)); % inter quartile range
x{2,6} = iqr(cond1data(:,2));
x{3,6} = iqr(cond2data(:,2));
x{1,7} = skewness(alldata(:,2));
x{2,7} = skewness(cond1data(:,2));
x{3,7} = skewness(cond2data(:,2));
x{1,8} = kurtosis(alldata(:,2));
x{2,8} = kurtosis(cond1data(:,2));
x{3,8} = kurtosis(cond2data(:,2));
% write output to .csv file using cell to table conversion
T = cell2table(x, 'VariableNames',{'mean', 'median', 'stddev', 'variance', 'range', 'IQR', 'skewness', 'kurtosis'});
writetable(T,'descriptivestats.csv')
I know there is a way to loop through this stuff and get the same output in a much shorter code. I tried to write a for-loop but I am just confusing myself and not sure how to do this. I'll include it anyway so maybe you can get an idea of what I'm trying to do.
x = cell(3,8);
data = [alldata, cond2data, cond2data];
dfunction = ['mean', 'median', 'std', 'var', 'range', 'iqr', 'skewness', 'kurtosis'];
for i = 1:8,
for y = 1:3
x{y,i} = dfucntion(i)(data(1)(:,2));
x{y+1,i} = dfunction(i)(data(2)(:,2));
x{y+2,i} = dfunction(i)(data(3)(:,2));
end
end
T = cell2table(x, 'VariableNames',{'mean', 'median', 'stddev', 'variance', 'range', 'IQR', 'skewness', 'kurtosis'});
writetable(T,'descriptivestats.csv')
Any ideas on how to make this work??
You want to use a cell array of function handles. The easiest way to do that is to use the # operator, as in
dfunctions = {#mean, #median, #std, #var, #range, #iqr, #skewness, #kurtosis};
Also, you want to combine your three data variables into one variable, to make it easier to iterate over them. There are two choices I can see. If your data variables are all M-by-2 in dimension, you could concatenate them into a M-by-2-by-3 three-dimensional array. You could do that with
data = cat(3, alldata, cond1data, cond2data);
The indexing expression into data that retrieves the values you want would be data(:, 2, y). That said, I think this approach would have to copy a lot of data around and probably isn't the best for performance. The other way to combine data together is in 1-by-3 cell array, like this:
data = {alldata, cond1data, cond2data};
The indexing expression into data that retrieves the values you want in this case would be data{y}(:, 2).
Since you are looping from y == 1 to y == 3, you only need one line in your inner loop body, not three.
for y = 1:3
x{y, i} = dfunctions{i}(data{y}(:,2));
end
Finally, to get the cell array of strings containing function names to pass to cell2table, you can use cellfun to apply func2str to each element of dfunctions:
funcnames = cellfun(#func2str, dfunctions, 'UniformOutput', false);
The final version looks like this:
dfunctions = {#mean, #median, #std, #var, #range, #iqr, #skewness, #kurtosis};
data = {alldata, cond1data, cond2data};
x = cell(length(data), length(dfunctions));
for i = 1:length(dfunctions)
for y = 1:length(data)
x{y, i} = dfunctions{i}(data{y}(:,2));
end
end
funcnames = cellfun(#func2str, dfunctions, 'UniformOutput', false);
T = cell2table(x, 'VariableNames', funcnames);
writetable(T,'descriptivestats.csv');
You can create a cell array of functions using str2func :
function_string = {'mean', 'median', 'std', 'var', 'range', 'iqr', 'skewness', 'kurtosis'};
dfunction = {};
for ii = 1:length(function_string)
fun{ii} = str2func(function_string{ii})
end
Then you can use it on your data as you'd like to :
for ii = 1:8,
for y = 1:3
x{y,i} = dfucntion{ii}(data(1)(:,2));
x{y+1,i} = dfunction{ii}(data(2)(:,2));
x{y+2,i} = dfunction{ii}(data(3)(:,2));
end
end

MATLAB: Creating a matrix from for loop values?

I have the following code:
for i = 1450:9740:89910
n = i+495;
range = ['B',num2str(i),':','H',num2str(n)];
iter = xlsread('BrokenDisplacements.xlsx' , range);
displ = iter;
displ = [displ; iter];
end
Which takes values from an Excel file from a number of ranges I want and outputs them as matricies. However, this code just uses the final value of displ and creates the total matrix from there. I would like to total these outputs (displ) into one large matrix saving values along the way, how would I go about doing this?
Since you know the size of the block of data you are reading, you can make your code much more efficient as follows:
firstVals = 1450:9740:89910;
displ = zeros((firstVals(end) - firstVals(1) + 1 + 496), 7);
for ii = firstVals
n = ii + 495;
range = sprintf('B%d:H%d', ii, ii+495);
displ((ii:ii+495)-firstVals(1)+1,:) = xlsread('BrokenDiplacements.xlsx', range);
end
Couple of points:
I prefer not to use i as a variable since it is built in as sqrt(-1) - if you later execute code that assumes that to be true, you're in trouble
I am not assuming that the last value of ii is 89910 - by first assigning the value to a vector, then finding the last value in the vector, I sidestep that question
I assign all space in iter at once - otherwise, as it grows, Matlab keeps having to move the array around which can slow things down a lot
I used sprintf to generate the string representing the range - I think it's more readable but it's a question of style
I assign the return value of xlsread directly to a block in displ that is the right size
I hope this helps.
How about this:
displ=[];
for i = 1450:9740:89910
n = i+495;
range = ['B',num2str(i),':','H',num2str(n)];
iter = xlsread('BrokenDisplacements.xlsx' , range);
displ = [displ; iter];
end

MATLAB: Dividing a year-length varying-resolution time vector into months

I have a time series in the following format:
time data value
733408.33 x1
733409.21 x2
733409.56 x3
etc..
The data runs from approximately 01-Jan-2008 to 31-Dec-2010.
I want to separate the data into columns of monthly length.
For example the first column (January 2008) will comprise of the corresponding data values:
(first 01-Jan-2008 data value):(data value immediately preceding the first 01-Feb-2008 value)
Then the second column (February 2008):
(first 01-Feb-2008 data value):(data value immediately preceding the first 01-Mar-2008 value)
et cetera...
Some ideas I've been thinking of but don't know how to put together:
Convert all serial time numbers (e.g. 733408.33) to character strings with datestr
Use strmatch('01-January-2008',DatesInChars) to find the indices of the rows corresponding to 01-January-2008
Tricky part (?): TransformedData(:,i) = OriginalData(start:end) ? end = strmatch(1) - 1 and start = 1. Then change start at the end of the loop to strmatch(1) and then run step 2 again to find the next "starting index" and change end to the "new" strmatch(1)-1 ?
Having it speed optimized would be nice; I am going to apply it on data sampled ~2 million times.
Thanks!
I would use histc with a list a list of last days of the month as the second parameter (Note: use histc with the two return functions).
The edge list can easily be created with datenum or datevec.
This way you don't have operation on string and you that should be fast.
EDIT:
Example with result in a simple data structure (including some code from #Rody):
% Generate some test times/data
tstart = datenum('01-Jan-2008');
tend = datenum('31-Dec-2010');
tspan = tstart : tend;
tspan = tspan(:) + randn(size(tspan(:))); % add some noise so it's non-uniform
data = randn(size(tspan));
% Generate list of edge
edge = [];
for y = 2008:2010
for m = 1:12
edge = [edge datenum(y, m, 1)];
end
end
% Histogram
[number, bin] = histc(tspan, edge);
% Setup of result
result = {};
for n = 1:length(edge)
result{n} = [tspan(bin == n), data(bin == n)];
end
% Test
% 04-Aug-2008 17:25:20
datestr(result{8}(4,1))
tspan(data == result{8}(4,2))
datestr(tspan(data == result{8}(4,2)))
Assuming you have sorted, non-equally-spaced date numbers, the way to go here is to put the relevant data in a cell array, so that each entry corresponds to the next month, and can hold a different amount of elements.
Here's how to do that quite efficiently:
% generate some test times/data
tstart = datenum('01-Jan-2008');
tend = datenum('31-Dec-2010');
tspan = tstart : tend;
tspan = tspan(:) + randn(size(tspan(:))); % add some noise so it's non-uniform
data = randn(size(tspan));
% find month numbers
[~,M] = datevec(tspan);
% find indices where the month changes
inds = find(diff([0; M]));
% extract data in columns
sz = numel(inds)-1;
cols = cell(sz,1);
for ii = 1:sz-1
cols{ii} = data( inds(ii) : inds(ii+1)-1 );
end
Note that it can be difficult to determine which entry in cols belongs to which month, year, so here's how to do it in a more human-readable way:
% change this line:
[y,M] = datevec(tspan);
% and change these lines:
cols = cell(sz,3);
for ii = 1:sz-1
cols{ii,1} = data( inds(ii) : inds(ii+1)-1 );
% also store the year and month
cols{ii,2} = y(inds(ii));
cols{ii,3} = M(inds(ii));
end
I'll assume you have a timeVals an Nx1 double vector holding the time value of each datum. Assuming data is also an Nx1 array. I also assume data and timeVals are sorted according to time: that is, the samples you have are ordered according to the time they were taken.
How about:
subs = #(x,i) x(:,i);
months = subs( datevec(timeVals), 2 ); % extract the month of year as a number from the time
r = find( months ~= [months(2:end), months(end)+1] );
monthOfCell = months( r );
r( 2:end ) = r( 2:end ) - r( 1:end-1 );
dataByMonth = mat2cell( data', r ); % might need to transpose data or r here...
timeByMonth = mat2cell( timeVal', r );
After running this code, you have a cell array dataByMonth each cell contains all data relevant to a specific month. The corresponding cell of timeByMonth holds the sampling times of the data of the respective month. Finally, monthOfCell tells you what is the month's number (1-12) of each cell.