I have a matrix in MATLAB of 50572x4 doubles. The last column has datenum format dates, increasing values from 7.3025e+05 to 7.3139e+05. The question is:
How can I split this matrix into sub-matrices, each that cover intervals of 30 days?
If I'm not being clear enough… the difference between the first element in the 4th column and the last element in the 4th column is 7.3139e5 − 7.3025e5 = 1.1376e3, or 1137.6. I would like to partition this into 30 day segments, and get a bunch of matrices that have a range of 30 for the 4th columns. I'm not quite sure how to go about doing this...I'm quite new to MATLAB, but the dataset I'm working with has only this representation, necessitating such an action.
Note that a unit interval between datenum timestamps represents 1 day, so your data, in fact, covers a time period of 1137.6 days). The straightforward approach is to compare each timestamps with the edges in order to determine which 30-day interval it belongs to:
t = A(:, end) - min(A:, end); %// Normalize timestamps to start from 0
idx = sum(bsxfun(#lt, t, 30:30:max(t))); %// Starting indices of intervals
rows = diff([0, idx, numel(t)]); %// Number of rows in each interval
where A is your data matrix, where the last column is assumed to contain the timestamps. rows stores the number of rows of the corresponding 30-day intervals. Finally, you can employ cell arrays to split the original data matrix:
C = mat2cell(A, rows, size(A, 2)); %// Split matrix into intervals
C = C(~cellfun('isempty', C)); %// Remove empty matrices
Hope it helps!
Well, all you need is to find the edge times and the matrix indexes in between them. So, if your numbers are at datenum format, one unit is the same as one day, which means that we can jump from 30 and 30 units until we get as close as we can to the end, as follows:
startTime = originalMatrix(1,4);
endTime = originalMatrix(end,4);
edgeTimes = startTime:30:endTime;
% And then loop though the edges checking for samples that complete a cycle:
nEdges = numel(edgeTimes);
totalMeasures = size(originalMatrix,1);
subMatrixes = cell(1,nEdges);
prevEdgeIdx = 0;
for curEdgeIdx = 1:nEdges
nearIdx=getNearestIdx(originalMatrix(:,4),edgeTimes(curEdgeIdx));
if originalMatrix(nearIdx,4)>edgeTimes(curEdgeIdx)
nearIdx = nearIdx-1;
end
if nearIdx>0 && nearIdx<=totalMeasures
subMatrix{curEdgeIdx} = originalMatrix(prevEdgeIdx+1:curEdgeIdx,:);
prevEdgeIdx=curEdgeIdx;
else
error('For some reason the edge was not inbound.');
end
end
% Now we check for the remaining days after the edges which does not complete a 30 day cycle:
if curEdgeIdx<totalMeasures
subMatrix{end+1} = originalMatrix(curEdgeIdx+1:end,:);
end
The function getNearestIdx was discussed here and it gives you the nearest point from the input values without checking all possible points.
function vIdx = getNearestIdx(values,point)
if isempty(values) || ~numel(values)
vIdx = [];
return
end
vIdx = 1+round((point-values(1))*(numel(values)-1)...
/(values(end)-values(1)));
if vIdx < 1, vIdx = []; end
if vIdx > numel(values), vIdx = []; end
end
Note: This is pseudocode and may contain errors. Please try to adjust it into your problem.
Related
I have several vectors I would like to concatenate, where each element is an increasing timestamp, but how do I concatenate the vectors while ensuring to have an continuously time scale?
Say, I have two vectors tone1_time and tone2_time both given by 1x4801 double. Each element of the vector contain a timestamps and thus the elements must be added when the vectors are concatenated in order to have the correct time. So far I have;
n = 10;
for i = 1:n
time(n,end) = tone1_time + tone2_time;
end
Which generates an error in matlab!
EDIT: More code
I generate two sounds vectors and concatenate them by:
% repeat n times
n = 10;
signal = [ tone1_signal tone2_signal ];
signal = repmat(signal,1,n);
This will e.g. return a new vector signal with a length of e.g. 1x48020 double. The time vector needs to have the same size as this vector, but also still having an continuously time.
first, you need to add the last element of tone1_time to all elements of tone2_time to ensure continuity of the time intervals:
tone2_time = tone2_time + tone1_time(end);
Then you can concat them
tone_time = [tone1_time, tone2_time];
Alternatively, you can work with the differences
tone_time = cumsum([diff([0 tone1_time]), diff([0 tone2_time])]);
EDIT:
For replicating the time vector:
tone_time_diff = [diff([0 tone1_time]), diff([0 tone2_time])];
tone_time = cumsum( repmat(tone_time_diff, 1, n) );
If I have one year worth of data
Jday = datenum('2010-01-01 00:00','yyyy-mm-dd HH:MM'):1/24:...
datenum('2010-12-31 23:00','yyyy-mm-dd HH:MM');
dat = rand(length(Jday),1);
and would like to calculate the monthly averages of 'dat', I would use:
% monthly averages
dateV = datevec(Jday);
[~,~,b] = unique(dateV(:,1:2),'rows');
monthly_av = accumarray(b,dat,[],#nanmean);
I would, however, like to calculate the monthly averages for the points that occur during the day i.e. between hours 6 and 18, how can this be done?
I can isolate the hours I wish to use in the monthly averages:
idx = dateV(:,4) >= 6 & dateV(:,4) <= 18;
and can then change 'b' to include only these points by:
b(double(idx) == 0) = 0;
and then calculate the averages by
monthly_av_new = accumarray(b,dat,[],#nanmean);
but this doesn't work because accumarray can only work with positive integers thus I get an error
Error using accumarray
First input SUBS must contain positive integer subscripts.
What would be the best way of doing what I've outlined? Keep in mind that I do not want to alter the variable 'dat' when doing this i.e. remove some values from 'dat' prior to calculating the averages.
Thinking about it, would the best solution be
monthly_av = accumarray(b(idx),dat(idx),[],#nanmean);
You almost have it. Just use logical indexing with idx in b and in dat:
monthly_av_new = accumarray(b(idx),dat(idx),[],#nanmean);
(and the line b(double(idx) == 0) = 0; is no longer needed).
This way, b(idx) contains only the indices corresponding to your desired hour interval, and data(idx) contains the corresponding values.
EDIT: Now I see you already found the solution! Yes, I think it's the best approach.
so I have a matrix Data in this format:
Data = [Date Time Price]
Now what I want to do is plot the Price against the Time, but my data is very large and has lines where there are multiple Prices for the same Date/Time, e.g. 1st, 2nd lines
29 733575.459548611 40.0500000000000
29 733575.459548611 40.0600000000000
29 733575.459548612 40.1200000000000
29 733575.45954862 40.0500000000000
I want to take an average of the prices with the same Date/Time and get rid of any extra lines. My goal is to do linear intrapolation on the values which is why I must have only one Time to one Price value.
How can I do this? I did this (this reduces the matrix so that it only takes the first line for the lines with repeated date/times) but I don't know how to take the average
function [ C ] = test( DN )
[Qrows, cols] = size(DN);
C = DN(1,:);
for i = 1:(Qrows-1)
if DN(i,2) == DN(i+1,2)
%n = 1;
%while DN(i,2) == DN(i+n,2) && i+n<Qrows
% n = n + 1;
%end
% somehow take average;
else
C = [C;DN(i+1,:)];
end
end
[C,ia,ic] = unique(A,'rows') also returns index vectors ia and ic
such that C = A(ia,:) and A = C(ic,:)
If you use as input A only the columns you do not want to average over (here: date & time), ic with one value for every row where rows you want to combine have the same value.
Getting from there to the means you want is for MATLAB beginners probably more intuitive with a for loop: Use logical indexing, e.g. DN(ic==n,3) you get a vector of all values you want to average (where n is the index of the date-time-row it belongs to). This you need to do for all different date-time-combinations.
A more vector-oriented way would be to use accumarray, which leads to a solution of your problem in two lines:
[DateAndTime,~,idx] = unique(DN(:,1:2),'rows');
Price = accumarray(idx,DN(:,3),[],#mean);
I'm not quite sure how you want the result to look like, but [DataAndTime Price] gives you the three-row format of the input again.
Note that if your input contains something like:
1 0.1 23
1 0.2 47
1 0.1 42
1 0.1 23
then the result of applying unique(...,'rows') to the input before the above lines will give a different result for 1 0.1 than using the above directly, as the latter would calculate the mean of 23, 23 and 42, while in the former case one 23 would be eliminates as duplicate before and the differing row with 42 would have a greater weight in the average.
Try the following:
[Qrows, cols] = size(DN);
% C is your result matrix
C = DN;
% this will give you the indexes where DN(i,:)==DN(i+1)
i = find(diff(DN(:,2)==0);
% replace C(i,:) with the average
C(i,:) = (DN(i,:)+DN(i+1,:))/2;
% delete the C(i+1,:) rows
C(i,:) = [];
Hope this works.
This should work if the repeated time values come in pairs (the average is calculated between i and i+1). Should you have time repeats of 3 or more then try to rethink how to change these steps.
Something like this would work, but I did not run the code so I can't promise there's no bugs.
newX = unique(DN(:,2));
newY = zeros(1,length(newX));
for ix = 1:length(newX)
allOcurrences = find(DN(:,2)==DN(i,2));
% If there's duplicates, take their mean
if numel(allOcurrences)>1
newY(ix) = mean(DN(allOcurrences,3));
else
% If not, use the only Y value
newY(ix) = DN(ix,3);
end
end
I have 19 cells (19x1) with temperature data for an entire year where the first 18 cells represent 20 days (each) and the last cell represents 5 days, hence (18*20)+5 = 365days.
In each cell there should be 7200 measurements (apart from cell 19) where each measurement is taken every 4 minutes thus 360 measurements per day (360*20 = 7200).
The time vector for the measurements is only expressed as day number i.e. 1,2,3...and so on (thus no decimal day),
which is therefore displayed as 360 x 1's... and so on.
As the sensor failed during some days, some of the cells contain less than 7200 measurements, where one in
particular only contains 858 rows, which looks similar to the following example:
a=rand(858,3);
a(1:281,1)=1;
a(281:327,1)=2;
a(327:328,1)=5;
a(329:330,1)=9;
a(331:498,1)=19;
a(499:858,1)=20;
Where column 1 = day, column 2 and 3 are the data.
By knowing that each day number should be repeated 360 times is there a method for including an additional
amount of every value from 1:20 in order to make up the 360. For example, the first column requires
79 x 1's, 46 x 2's, 360 x 3's... and so on; where the final array should therefore have 7200 values in
order from 1 to 20.
If this is possible, in the rows where these values have been added, the second and third column should
changed to nan.
I realise that this is an unusual question, and that it is difficult to understand what is asked, but I hope I have been clear in expressing what i'm attempting to
acheive. Any advice would be much appreciated.
Here's one way to do it for a given element of the cell matrix:
full=zeros(7200,3)+NaN;
for i = 1:20 % for each day
starti = (i-1)*360; % find corresponding 360 indices into full array
full( starti + (1:360), 1 ) = i; % assign the day
idx = find(a(:,1)==i); % find any matching data in a for that day
full( starti + (1:length(idx)), 2:3 ) = a(idx,2:3); % copy matching data over
end
You could probably use arrayfun to make this slicker, and maybe (??) faster.
You could make this into a function and use cellfun to apply it to your cell.
PS - if you ask your question at the Matlab help forums you'll most definitely get a slicker & more efficient answer than this. Probably involving bsxfun or arrayfun or accumarray or something like that.
Update - to do this for each element in the cell array the only change is that instead of searching for i as the day number you calculate it based on how far allong the cell array you are. You'd do something like (untested):
for k = 1:length(cellarray)
for i = 1:length(cellarray{k})
starti = (i-1)*360; % ... as before
day = (k-1)*20 + i; % first cell is days 1-20, second is 21-40,...
full( starti + (1:360),1 ) = day; % <-- replace i with day
idx = find(a(:,1)==day); % <-- replace i with day
full( starti + (1:length(idx)), 2:3 ) = a(idx,2:3); % same as before
end
end
I am not sure I understood correctly what you want to do but this below works out how many measurements you are missing for each day and add at the bottom of your 'a' matrix additional lines so you do get the full 7200x3 matrix.
nbMissing = 7200-size(a,1);
a1 = nan(nbmissing,3)
l=0
for i = 1:20
nbMissing_i = 360-sum(a(:,1)=i);
a1(l+1:l+nbMissing_i,1)=i;
l = l+nb_Missing_i;
end
a_filled = [a;a1];
I have a bunch of times-series each described by two components, a timestamp vector (in seconds), and a vector of values measured. The time vector is non-uniform (i.e. sampled at non-regular intervals)
I am trying to compute the mean/SD of each 1-minutes interval of values (take X minute interval, compute its mean, take the next interval, ...).
My current implementation uses loops. This is a sample of what I have so far:
t = (100:999)' + rand(900,1); %' non-uniform time
x = 5*rand(900,1) + 10; % x(i) is the value at time t(i)
interval = 1; % 1-min interval
tt = ( floor(t(1)):interval*60:ceil(t(end)) )'; %' stopping points of each interval
N = length(tt)-1;
mu = zeros(N,1);
sd = zeros(N,1);
for i=1:N
indices = ( tt(i) <= t & t < tt(i+1) ); % find t between tt(i) and tt(i+1)
mu(i) = mean( x(indices) );
sd(i) = std( x(indices) );
end
I am wondering if there a faster vectorized solution. This is important because I have a large number of time-series to process each much longer than the sample shown above..
Any help is welcome.
Thank you all for the feedback.
I corrected the way t is generated to be always monotonically increasing (sorted), this was not really an issue..
Also, I may not have stated this clearly but my intention was to have a solution for any interval length in minutes (1-min was just an example)
The only logical solution seems to be...
Ok. I find it funny that to me there is only one logical solution, but many others find other solutions. Regardless, the solution does seem simple. Given the vectors x and t, and a set of equally spaced break points tt,
t = sort((100:999)' + 3*rand(900,1)); % non-uniform time
x = 5*rand(900,1) + 10; % x(i) is the value at time t(i)
tt = ( floor(t(1)):1*60:ceil(t(end)) )';
(Note that I sorted t above.)
I would do this in three fully vectorized lines of code. First, if the breaks were arbitrary and potentially unequal in spacing, I would use histc to determine which intervals the data series falls in. Given they are uniform, just do this:
int = 1 + floor((t - t(1))/60);
Again, if the elements of t were not known to be sorted, I would have used min(t) instead of t(1). Having done that, use accumarray to reduce the results into a mean and standard deviation.
mu = accumarray(int,x,[],#mean);
sd = accumarray(int,x,[],#std);
You could try and create a cell array and apply mean and std via cellfun. It's ~10% slower than your solution for 900 entries, but ~10x faster for 90000 entries.
[t,sortIdx]=sort(t); %# we only need to sort in case t is not monotonously increasing
x = x(sortIdx);
tIdx = floor(t/60); %# convert seconds to minutes - can also convert to 5 mins by dividing by 300
tIdx = tIdx - min(tIdx) + 1; %# tIdx now is a vector of indices - i.e. it starts at 1, and should go like your iteration variable.
%# the next few commands are to count how many 1's 2's 3's etc are in tIdx
dt = [tIdx(2:end)-tIdx(1:end-1);1];
stepIdx = [0;find(dt>0)];
nIdx = stepIdx(2:end) - stepIdx(1:end-1); %# number of times each index appears
%# convert to cell array
xCell = mat2cell(x,nIdx,1);
%# use cellfun to calculate the mean and sd
mu(tIdx(stepIdx+1)) = cellfun(#mean,xCell); %# the indexing is like that since there may be missing steps
sd(tIdx(stepIdx+1)) = cellfun(#mean,xCell);
Note: my solution does not give the exact same results as yours, since you skip a few time values at the end (1:60:90 is [1,61]), and since the start of the interval is not exactly the same.
Here's a way that uses binary search. It is 6-10x faster for 9900 elements and about 64x times faster for 99900 elements. It was hard to get reliable times using only 900 elements so I'm not sure which is faster at that size. It uses almost no extra memory if you consider making tx directly from the generated data. Other than that it just has four extra float variables (prevind, first, mid, and last).
% Sort the data so that we can use binary search (takes O(N logN) time complexity).
tx = sortrows([t x]);
prevind = 1;
for i=1:N
% First do a binary search to find the end of this section
first = prevind;
last = length(tx);
while first ~= last
mid = floor((first+last)/2);
if tt(i+1) > tx(mid,1)
first = mid+1;
else
last = mid;
end;
end;
mu(i) = mean( tx(prevind:last-1,2) );
sd(i) = std( tx(prevind:last-1,2) );
prevind = last;
end;
It uses all of the variables that you had originally. I hope that it suits your needs. It is faster because it takes O(log N) to find the indices with binary search, but O(N) to find them the way you were doing it.
You can compute indices all at once using bsxfun:
indices = ( bsxfun(#ge, t, tt(1:end-1)') & bsxfun(#lt, t, tt(2:end)') );
This is faster than looping but requires storing them all at once (time vs space tradeoff)..
Disclaimer: I worked this out on paper, but haven't yet had the opportunity to check it "in silico"...
You may be able to avoid loops or using cell arrays by doing some tricky cumulative sums, indexing, and calculating the means and standard deviations yourself. Here's some code that I believe will work, although I am unsure how it stacks up speed-wise to the other solutions:
[t,sortIndex] = sort(t); %# Sort the time points
x = x(sortIndex); %# Sort the data values
interval = 60; %# Interval size, in seconds
intervalIndex = floor((t-t(1))./interval)+1; %# Collect t into intervals
nIntervals = max(intervalIndex); %# The number of intervals
mu = zeros(nIntervals,1); %# Preallocate mu
sd = zeros(nIntervals,1); %# Preallocate sd
sumIndex = [find(diff(intervalIndex)) ...
numel(intervalIndex)]; %# Find indices of the interval ends
n = diff([0 sumIndex]); %# Number of samples per interval
xSum = cumsum(x); %# Cumulative sum of x
xSum = diff([0 xSum(sumIndex)]); %# Sum per interval
xxSum = cumsum(x.^2); %# Cumulative sum of x^2
xxSum = diff([0 xxSum(sumIndex)]); %# Squared sum per interval
intervalIndex = intervalIndex(sumIndex); %# Find index into mu and sd
mu(intervalIndex) = xSum./n; %# Compute mean
sd(intervalIndex) = sqrt((xxSum-xSum.*xSum./n)./(n-1)); %# Compute std dev
The above computes the standard deviation using the simplification of the formula found on this Wikipedia page.
The same answer as above but with the parametric interval (window_size).
Issue with the vector lengths solved as well.
window_size = 60; % but it can be any value 60 5 0.1, which wasn't described above
t = sort((100:999)' + 3*rand(900,1)); % non-uniform time
x = 5*rand(900,1) + 10; % x(i) is the value at time t(i)
int = 1 + floor((t - t(1))/window_size);
tt = ( floor(t(1)):window_size:ceil(t(end)) )';
% mean val and std dev of the accelerations at speed
mu = accumarray(int,x,[],#mean);
sd = accumarray(int,x,[],#std);
%resolving some issue with sizes (for i.e. window_size = 1 in stead of 60)
while ( sum(size(tt) > size(mu)) > 0 )
tt(end)=[];
end
errorbar(tt,mu,sd);