I have observed daily data that I need to compare to generated Monthly data so I need to get a mean of each month over the thirty year period.
My observed data set is currently in 365x31 with rows being each day (no leap years!) and the extra column being the month number (1-12).
the problem I am having is that I can only seem to get a script to get the mean of all years. ie. I cannot figure how to get the script to do it for each column separately. Example of the data is below:
1 12 14
1 -15 10
2 13 3
2 2 37
...all the way to 12 for 365 rows
SO: to recap, I need to get the mean of [12; -15; 13; 2] then [14; 10; 3; 37] and so on.
I have been trying to use the unique() function to loop through which works for getting the number rows to average but incorrect means. Now I need it to do each month(28-31 rows) and column individually. Result should be a 12x30 matrix. I feel like I am missing something SIMPLE. Code:
u = unique(m); %get unique values of m (months) ie) 1-12
for i=1:length(u)
month(i) = mean(obatm(u(i), (2:31)); % the average for each month of each year
end
Appreciate any ideas! Thanks!
You can simply filter the rows for each month and then apply mean, like so:
month = zeros(12, size(obatm, 2));
for k = 1:12
month(k, :) = mean(obatm(obatm(:, 1) == k, :));
end
EDIT:
If you want something fancy, you can also do this:
cols = size(obatm, 2) - 1;
subs = bsxfun(#plus, obatm(:, 1), (0:12:12 * (cols - 1)));
vals = obatm(:, 2:end);
month = reshape(accumarray(subs(:), vals(:), [12 * cols, 1], #mean), 12, cols)
Look, Ma, no loops!
Related
I have a 35136-by-1 matrix containing power data of 366 days with every day having 96 measurements). I want to take a sample from 252 days: power data of "day 1 to day 7" is the first sample, power data of "day 2 to day 8" is the second sample, etc.), and reshape my matrix to size [96 7 1 252].
I wrote following code, but I get 36 sample instead of 252
m=7;
for j=1
sample([j:96*m],:)=solarpower_n([j:96*m],:);
y([(96*m)+1:96*(m+1)],:)=solarpower_n([(96*m)+1:96*(m+1)],:);
m=m+1;
for j=2:246
sample([(96*(j-1))+1:96*m],:)=solarpower_n([(96*(j-1))+1:96*m],:);
y([(96*m)+1:96*(m+1)],:)=solarpower_n([(96*m)+1:96*(m+1)],:);
m=m+1;
end
end
I want to take sample from each 7 days. Assume D to be the number of days, and M as number of power measurements on each day. For 252 days, M=[1,2,3,...,96] and D=[1,2,...,252] . Thus the power of first day, P1, has a dimension of 96*1. I want to take sample1={P1,...,P7}, sample2={P2,...,P8} , .....,sample252={P246,.....,P252}. and have a [96 7 1 252] 4-D array.
How can I accomplish this?
Taking samples that way is rather inefficient, since you're copying each data point 7 times. You could simply use indexing:
A = rand(96*366, 1); % Sample data
B = reshape(A,[96 366]); % Reshape all your days in one go
B(:, 1:7) % first 7 days
B(:, 163:170) % Days 163 to 170, etc.
If you do want to copy your data seven times to your 4D array you can use a simple for loop:
A = rand(96*366, 1); % Sample data
% Note you need days 253:256, since P252 contains those days
B = reshape(A(1:96*(252+6)),[96 (252+6)]); % Reshape your first 252 days
C = zeros(size(B,1), 7, 1, size(B,2)-6); % Initialise output
for ii = 1:size(B,2)-6
C(:, :, :, ii) = B(:, ii:ii+6); % Save each 7 day sample
end
Getting rid of the for loop is difficult, given you want a sliding window. There are probably specialised functions for that somewhere, but given your data size a loop should be sufficiently performant.
For a short introduction on reshape() you can read this answer of mine.
I have three vectors of the same size, pressure, year and month. Basically I would like to create a matrix of pressure values that correspond with the months and years that they were measured using a for loop. It should be 12x100 in order to appear as 12 months going down and 100 years going left to right.
I am just unsure of how to actually create the matrix, besides creating the initial structure. So far I can only find pressure for a single month (below I did January) for all years.
A = zeros([12, 100]);
for some_years = 1900:2000
press = pressure(year == some_years & month == 1)
end
And I can only print the pressures for January for all years, but I would like to store all pressures for all months of the years in a matrix. If anyone can help it would be greatly appreciated. Thank You.
Starting with variables pressure, year, and month. I would do something like:
A fairly robust solution using for loops:
T = length(pressure); % get number of time periods. I will assume vectors same length
if(length(pressure) ~= T || length(month) ~= T)
error('length mismatch');
end
min_year = min(year); % this year will correspond to index 1
max_year = max(year);
A = NaN(max_year - min_year + 1, 12); % I like to initialize to NaN (not a number)
% this way missing values are NaN
for i=1:T
year_index = year(i) - min_year + 1;
month_index = month(i); % Im assuming months run from 1 to 12
A(year_index, month_index) = pressure(i);
end
If you're data is SUPER nicely formatted....
If your data has NO missing, duplicate, or out of order year month pairs (i.e. data is formatted like):
year month pressure
1900 1 ...
1900 2 ...
... ... ...
1900 12 ...
1901 1 ...
... ... ...
Then you could do the ONE liner:
A = reshape(pressure, 12, max(year) - min(year) + 1)';
Suppose I have a date vector shown here by tt and a corresponding data series corresponding to aa. For example:
dd = datestr(datenum('2007-01-01 00:00','yyyy-mm-dd HH:MM'):1/24:...
datenum('2011-12-31 23:00','yyyy-mm-dd HH:MM'),...
'yyyy-mm-dd HH:MM');
tt = datevec(datenum(dd,'yyyy-mm-dd HH:MM'));
tt(1002,:) = [];
aa = rand(length(tt),1)
How is it possible to ensure that the hours and days are consistent among the years?
For example, I only want to keep times that are the same among years e.g.
2009-01-01 01:00
would be the same as
2010-01-01 01:00
ad so on.
If one year has a measurements at
2009-01-01 02:00
but yyyy-01-01 02:00
is not present in the other years, this time should re removed.
I would like the to return tt and aa where only those times that are consistent among the years are kept. how can this be done?
I was considering finding the indices for the unique years first as:
[~,~,iyears] = unique(tt(:,1),'rows');
and then find the indices for the unique month, day, and hour as:
[~,~,iid] = unique(tt(:,2:4),'rows');
but I am not sure how to combine these to give the desired output?
The solution below uses a loop to store data in an unitialized array, which can be inefficient, but unless your dataset is huge (with many, many years) it should do the job. The general idea is to break the dataset up into years. I am storing the resulting time-vectors in a cell array, because they probably won't have the same length. I then do a set-intersection of all the time vectors, to get a vector of common times. From there it's straight forward.
years = unique(tt(:,1), 'rows');
% Put the "sub-times" of each year into cell array
for ii = 1:length(years)
times_each_year{ii} = tt(tt(:,1)==years(ii),2:end);
end
% Do intersection of all "sub-times" sets
common_times = times_each_year{1};
for ii = 2:length(years)
common_times = intersect(common_times, times_each_year{ii},'rows');
end
% Find and delete the points that are not member of the "sub-times":
idx = ~ismember(tt(:,2:end),common_times,'rows');
deleted_points = datestr(tt(idx,:)); % for later review
tt(idx,:) = [];
However, note that the deleted_points vector contains more points than one might expect. That's because 2008 was a leap year, and all the points corresponding to Febr. 29th were deleted.
Another such oddity might await you if your data is "contaminated" by daylight savings time.
Code
a1 = str2num(datestr(tt,'mmddHHMM')); %// If in your data minutes are always 00, you can use 'mmddHH' instead and save some runtime
k1 = unique(a1);
gt1 = histc(a1,k1);
valid_rows = ismember(a1,k1(gt1==max(gt1)));
new_tt = tt(valid_rows,:); %// Desired tt output
new_aa = aa(valid_rows,:); %// Desired aa output
Explanation
To understand how it works, let's test out the code at a micro-level. Let's assume some small data that corresponds to tt -
data1 = [4 5 1 4 5 1 4 5 6]
data1 is the data collected over few sets and resembles tt that has data over few years with month, date, hour and minutes when these four parameters are conglomerated into a single parameter.
One can notice it would represent data from three sets/years with data as {4,5}, {1,4,5} and {1,4,5,6}. Our job is to found out all those values in data1 that is repeated across all the three years/sets of data. Thus, the final output must be {4,5}.
Let's see how this can be coded up.
Step 1: Get the unique values
unique_val = unique(data1)
We would have - [1 4 5 6]
Step 2: Get the count of unique values in the data
count_unique_val = histc(data1,unique_val)
Output is - [2 3 3 1]
Step 3: Get the indices from the unique values array where their counts are equal to the maximum of counts, indicating those are the unique values that are repeated across all the sets.
index1 = count_unique_val==max(count_unique_val)
Output comes out as - [0 1 1 0]
Step 4: Get those "consistent" unique values
consistent_val = unique_val(index1)
Gives us - [4 5], which is what we were looking for.
Step 5: Finally get the indices where the consistent data is present,
which can be used later on to select the rows with "consistent" data.
index_consistent_val = ismember(data1,consistent_val)
Output is - [1 1 0 1 1 0 1 1 0], which makes sense too.
Please note that in the original code a1 = str2num(datestr(tt,'mmddHHMM')); gets us the single parameter from the four parameters of month, date, hour and minutes as discussed in the comments earlier too.
so I have a matrix Data in this format:
Data = [Date Time Price]
Now what I want to do is plot the Price against the Time, but my data is very large and has lines where there are multiple Prices for the same Date/Time, e.g. 1st, 2nd lines
29 733575.459548611 40.0500000000000
29 733575.459548611 40.0600000000000
29 733575.459548612 40.1200000000000
29 733575.45954862 40.0500000000000
I want to take an average of the prices with the same Date/Time and get rid of any extra lines. My goal is to do linear intrapolation on the values which is why I must have only one Time to one Price value.
How can I do this? I did this (this reduces the matrix so that it only takes the first line for the lines with repeated date/times) but I don't know how to take the average
function [ C ] = test( DN )
[Qrows, cols] = size(DN);
C = DN(1,:);
for i = 1:(Qrows-1)
if DN(i,2) == DN(i+1,2)
%n = 1;
%while DN(i,2) == DN(i+n,2) && i+n<Qrows
% n = n + 1;
%end
% somehow take average;
else
C = [C;DN(i+1,:)];
end
end
[C,ia,ic] = unique(A,'rows') also returns index vectors ia and ic
such that C = A(ia,:) and A = C(ic,:)
If you use as input A only the columns you do not want to average over (here: date & time), ic with one value for every row where rows you want to combine have the same value.
Getting from there to the means you want is for MATLAB beginners probably more intuitive with a for loop: Use logical indexing, e.g. DN(ic==n,3) you get a vector of all values you want to average (where n is the index of the date-time-row it belongs to). This you need to do for all different date-time-combinations.
A more vector-oriented way would be to use accumarray, which leads to a solution of your problem in two lines:
[DateAndTime,~,idx] = unique(DN(:,1:2),'rows');
Price = accumarray(idx,DN(:,3),[],#mean);
I'm not quite sure how you want the result to look like, but [DataAndTime Price] gives you the three-row format of the input again.
Note that if your input contains something like:
1 0.1 23
1 0.2 47
1 0.1 42
1 0.1 23
then the result of applying unique(...,'rows') to the input before the above lines will give a different result for 1 0.1 than using the above directly, as the latter would calculate the mean of 23, 23 and 42, while in the former case one 23 would be eliminates as duplicate before and the differing row with 42 would have a greater weight in the average.
Try the following:
[Qrows, cols] = size(DN);
% C is your result matrix
C = DN;
% this will give you the indexes where DN(i,:)==DN(i+1)
i = find(diff(DN(:,2)==0);
% replace C(i,:) with the average
C(i,:) = (DN(i,:)+DN(i+1,:))/2;
% delete the C(i+1,:) rows
C(i,:) = [];
Hope this works.
This should work if the repeated time values come in pairs (the average is calculated between i and i+1). Should you have time repeats of 3 or more then try to rethink how to change these steps.
Something like this would work, but I did not run the code so I can't promise there's no bugs.
newX = unique(DN(:,2));
newY = zeros(1,length(newX));
for ix = 1:length(newX)
allOcurrences = find(DN(:,2)==DN(i,2));
% If there's duplicates, take their mean
if numel(allOcurrences)>1
newY(ix) = mean(DN(allOcurrences,3));
else
% If not, use the only Y value
newY(ix) = DN(ix,3);
end
end
I have 19 cells (19x1) with temperature data for an entire year where the first 18 cells represent 20 days (each) and the last cell represents 5 days, hence (18*20)+5 = 365days.
In each cell there should be 7200 measurements (apart from cell 19) where each measurement is taken every 4 minutes thus 360 measurements per day (360*20 = 7200).
The time vector for the measurements is only expressed as day number i.e. 1,2,3...and so on (thus no decimal day),
which is therefore displayed as 360 x 1's... and so on.
As the sensor failed during some days, some of the cells contain less than 7200 measurements, where one in
particular only contains 858 rows, which looks similar to the following example:
a=rand(858,3);
a(1:281,1)=1;
a(281:327,1)=2;
a(327:328,1)=5;
a(329:330,1)=9;
a(331:498,1)=19;
a(499:858,1)=20;
Where column 1 = day, column 2 and 3 are the data.
By knowing that each day number should be repeated 360 times is there a method for including an additional
amount of every value from 1:20 in order to make up the 360. For example, the first column requires
79 x 1's, 46 x 2's, 360 x 3's... and so on; where the final array should therefore have 7200 values in
order from 1 to 20.
If this is possible, in the rows where these values have been added, the second and third column should
changed to nan.
I realise that this is an unusual question, and that it is difficult to understand what is asked, but I hope I have been clear in expressing what i'm attempting to
acheive. Any advice would be much appreciated.
Here's one way to do it for a given element of the cell matrix:
full=zeros(7200,3)+NaN;
for i = 1:20 % for each day
starti = (i-1)*360; % find corresponding 360 indices into full array
full( starti + (1:360), 1 ) = i; % assign the day
idx = find(a(:,1)==i); % find any matching data in a for that day
full( starti + (1:length(idx)), 2:3 ) = a(idx,2:3); % copy matching data over
end
You could probably use arrayfun to make this slicker, and maybe (??) faster.
You could make this into a function and use cellfun to apply it to your cell.
PS - if you ask your question at the Matlab help forums you'll most definitely get a slicker & more efficient answer than this. Probably involving bsxfun or arrayfun or accumarray or something like that.
Update - to do this for each element in the cell array the only change is that instead of searching for i as the day number you calculate it based on how far allong the cell array you are. You'd do something like (untested):
for k = 1:length(cellarray)
for i = 1:length(cellarray{k})
starti = (i-1)*360; % ... as before
day = (k-1)*20 + i; % first cell is days 1-20, second is 21-40,...
full( starti + (1:360),1 ) = day; % <-- replace i with day
idx = find(a(:,1)==day); % <-- replace i with day
full( starti + (1:length(idx)), 2:3 ) = a(idx,2:3); % same as before
end
end
I am not sure I understood correctly what you want to do but this below works out how many measurements you are missing for each day and add at the bottom of your 'a' matrix additional lines so you do get the full 7200x3 matrix.
nbMissing = 7200-size(a,1);
a1 = nan(nbmissing,3)
l=0
for i = 1:20
nbMissing_i = 360-sum(a(:,1)=i);
a1(l+1:l+nbMissing_i,1)=i;
l = l+nb_Missing_i;
end
a_filled = [a;a1];