correlation coefficient between cells - matlab

I have a dataset stored in a similar manner to the follwing example:
clear all
Year = cell(1,4);
Year{1} = {'Y2007','Y2008','Y2009','Y2010','Y2011'};
Year{2} = {'Y2005','Y2006','Y2007','Y2008','Y2009'};
Year{3} = {'Y2009','Y2010','Y2011'};
Year{4} = {'Y2007','Y2008','Y2009','Y2010','Y2011'};
data = cell(1,4);
data{1} = {rand(26,1),rand(26,1),rand(26,1),rand(26,1),rand(26,1)};
data{2} = {rand(26,1),rand(26,1),rand(26,1),rand(26,1),rand(26,1)};
data{3} = {rand(26,1),rand(26,1),rand(26,1)};
data{4} = {rand(26,1),rand(26,1),rand(26,1),rand(26,1),rand(26,1)};
Where each cell in 'Year' represents the time where each measurement in 'data' was collected. For example, the first cell in Year ('Year{1}') contains the year where each measurements in 'data{1}' was collected so that data{1}{1} was collected in 'Y2007', data{1}{2} in 'Y2008'...and so on
I am now trying to find the correlation between each measurement with the corresponding (same year) measurement from the other locations. For example for the year 'Y2007' I would like to find the correlation between data{1}{1} and data{2}{3}, then data{1}{1} and data{4}{1}, and then data{2}{3} and data{4}{1} and so on for the remaining years.
I know that the corrcoef command should be used to calculate the correlation, but I cannot seem to get to the stage where this is possible. Any advice would be much appreciated.

I assume one year appears only once per cell. Here is a code I end up with (see comments for explanations):
yu = unique([Year{:}]); %# cell array of unique year across all cells
cc = cell(size(yu)); %# cell array for each year
for y = 1:numel(yu)
%# which cells have y-th year
yuidx = cellfun(#(x) find(ismember(x,yu{y})), Year, 'UniformOutput',0);
yidx = find(cellfun(#(x) ~isempty(x), yuidx, 'UniformOutput',1));
if numel(yidx) <= 1
continue
end
%# find indices for y-th year in each cell
yidx2 = cell2mat(yuidx(yidx));
%# fill matrix to calculate correlation
ydata = zeros(26,numel(yidx));
for k = 1:numel(yidx)
ydata(:,k) = data{yidx(k)}{yidx2(k)};
end
%# calculate correlation coefficients
cc{y} = corr(ydata);
end
yu will have list of all years. cc will contain correlation matrices for each year. If you want you can also keep yidx (if you make it a cell array changing the code accordingly).

Related

Extracting all minimum values from a data-set into a matrix

I'm trying to write a program which extracts the minimum temperatures from a data-set, data, for a given month, time and year.
Meaning that a person should be able to select a time, start-/endyear and get a matrix, lowTempsOverYears, which should contain all the lowest-recored temperatures for january-december between two selected years at certain time.
Demonstrating what I mean I'll give a brief example. Take the two years: 1997-2001 and a time say 1200. This should give me a matrix containg the lowest tempartures recored for all months between the years 1997 and 2001. The output should be a 4x12 matrix where I have 4 different temperatures for every column which denotes the month.
You can find my program below:
function algo= getMiniserie(data, startYear, endYear, time)
YearInterval = startYear:1:endYear;
for month = 1:12
lowTempsOverYears = zeros(length(YearInterval),12);
for yearNumber = 1:length(YearInterval)
year = YearInterval(yearNumber);
p = extractperiod(data,year,month,time);
if ~isempty(p)
q = min(p);
lowTempsOverYears(yearNumber,month) = q;
end
end
algo = lowTempsOverYears;
end
end
The datavariable, from which I extract my data, consists of 3 columns and 400k+ so rows.
*first column denotes the date(YYYYMMDD)
*second column denotes the time
*third column denotes the temperature
And what the extractperiod function does is that it, as the name would suggest, extracts all temperatures for a given month/year/time.
When I try to call my function by:
>> getMiniserie(data, 1997, 2001, 1200)
I get https://imgur.com/a/XpfqUoh .
Any ideas to how I could improve my code to get my desired output?
My idea was that to make a variable which stores all the minimum values for each iteration of month.
So I initilized lowTempsOverYearsto make it a(in this particular case wherethe start-/endyear is 1997 and 2001) 4x12 matrix. Where during the first month-iteration it stores all the minimum temperatures for january in the first column , where all the selected years are represented by the rows.
Please feel free to ask if I've omitted something from my explanation, I'll happily add to the picture.
code for extractperiod
function mdata = extractperiod(data,year,month,time)
x = year*100 + month;
k = find(floor(data(:,1)/100) == x & (data(:,2) == time));
mdata = data(k,3);
end
Because the first command inside your month loop is lowTempsOverYears = zeros(length(YearInterval),12);, you're resetting lowTempsOverYears to a matrix of zeros each time through the loop. This erases the output of each previous loop. The final time through the loop, you reset all values to zero, then fill in the 12th column.
Move the line lowTempsOverYears = zeros(length(YearInterval),12); outside your month loop, as shown below.
function algo= getMiniserie(data, startYear, endYear, time)
YearInterval = startYear:1:endYear;
lowTempsOverYears = zeros(length(YearInterval),12);
for month = 1:12
for yearNumber = 1:length(YearInterval)
year = YearInterval(yearNumber);
p = extractperiod(data,year,month,time);
if ~isempty(p)
q = min(p);
lowTempsOverYears(yearNumber,month) = q;
end
end
algo = lowTempsOverYears;
end
end

Vectorization instead of nested for loops in matlab

I am having trouble vectorizing this for loop in matlab which is really slow.
tvec and data are N×6 and N×4 arrays respectively, and they are the inputs to the function.
% preallocate
sVec = size(tvec)
tvec_ab = zeros(sVec(1),6);
data_ab = zeros(sVec(1),4);
inc = 0;
for i = 1:12
for j = 1:31
inc = inc +1;
[I,~] = find(tvec(:,3)==i & tvec(:,2)== j,1,'first');
if(I > 0)
tvec_ab(inc,:) = tvec(I,:);
data_ab(inc,:) = sum(data( (tvec(:,3) == j) & (tvec(:,2)==i) ,:));
end
end
end
% set output values
tvec_a = tvec_ab(1:inc,:);
data_a = data_ab(1:inc,:);
Every row in tvec represents the timestamp where the data was taken in the same row in the data matrix. Below you can see how a row would look like:
tvec:
[year, month, day, hour, minute, second]
data:
[dataA, dataB, dataC, dataD]
In the main program we can choose to "aggregate" after month, day or hour.
The code above is an example of how the aggregation for the option 'DAY' could happen.
the first time stamp of the day is the time stamp we want our output tvec_a to have in the row for that day.
The data output for that day (row in this case) would then be the sum of all the data for that day. Example:
data:
[data1ADay1, data1BDay1, data1CDay1, data1DDay1;
data2ADay1, data2BDay1, data2CDay1, data2DDay1]
aggregated data:
[data1ADay1 + data2ADay1, data1BDay1 + data2BDay1, data1CDay1+ data2CDay1,
data1DDay1+data2DDay1]
A vectorized version (not fully tested)
[x y] = meshgrid(1:12,1:31);
XY=[x(:) Y(:)];
[I,loc]=ismember(XY,tvec(:,2:3),'rows');
tvec_ab(I)=tvec(loc(loc>0),:);
acm = accumarray(tvec(:,2:3),data);
data_ab(I) = acm(sub2ind(size(acm),tvec(:,2),tvec(:,3)));
I actually found a way to do it myself:
%J is the indexes of the first unique days ( eg. if there is multiple
%data from january 1., the first time stamp from january 1. will be
%the time samp for our output)
[~,J,K] = unique(tvec(:,2:3),'rows');
%preallocate
tvec_ab = zeros(length(J),6);
data_ab = zeros(length(J),4);
tvec_ab = tvec(J,:);
%sum all data from the same days together column wise.
for i = 1:4
data_ab(:,i) = accumarray(K,data(:,i));
end
%set output
data_a = data_ab;
tvec_a = tvec_ab;
Thanks for your responses though

MATLAB: Dividing a year-length varying-resolution time vector into months

I have a time series in the following format:
time data value
733408.33 x1
733409.21 x2
733409.56 x3
etc..
The data runs from approximately 01-Jan-2008 to 31-Dec-2010.
I want to separate the data into columns of monthly length.
For example the first column (January 2008) will comprise of the corresponding data values:
(first 01-Jan-2008 data value):(data value immediately preceding the first 01-Feb-2008 value)
Then the second column (February 2008):
(first 01-Feb-2008 data value):(data value immediately preceding the first 01-Mar-2008 value)
et cetera...
Some ideas I've been thinking of but don't know how to put together:
Convert all serial time numbers (e.g. 733408.33) to character strings with datestr
Use strmatch('01-January-2008',DatesInChars) to find the indices of the rows corresponding to 01-January-2008
Tricky part (?): TransformedData(:,i) = OriginalData(start:end) ? end = strmatch(1) - 1 and start = 1. Then change start at the end of the loop to strmatch(1) and then run step 2 again to find the next "starting index" and change end to the "new" strmatch(1)-1 ?
Having it speed optimized would be nice; I am going to apply it on data sampled ~2 million times.
Thanks!
I would use histc with a list a list of last days of the month as the second parameter (Note: use histc with the two return functions).
The edge list can easily be created with datenum or datevec.
This way you don't have operation on string and you that should be fast.
EDIT:
Example with result in a simple data structure (including some code from #Rody):
% Generate some test times/data
tstart = datenum('01-Jan-2008');
tend = datenum('31-Dec-2010');
tspan = tstart : tend;
tspan = tspan(:) + randn(size(tspan(:))); % add some noise so it's non-uniform
data = randn(size(tspan));
% Generate list of edge
edge = [];
for y = 2008:2010
for m = 1:12
edge = [edge datenum(y, m, 1)];
end
end
% Histogram
[number, bin] = histc(tspan, edge);
% Setup of result
result = {};
for n = 1:length(edge)
result{n} = [tspan(bin == n), data(bin == n)];
end
% Test
% 04-Aug-2008 17:25:20
datestr(result{8}(4,1))
tspan(data == result{8}(4,2))
datestr(tspan(data == result{8}(4,2)))
Assuming you have sorted, non-equally-spaced date numbers, the way to go here is to put the relevant data in a cell array, so that each entry corresponds to the next month, and can hold a different amount of elements.
Here's how to do that quite efficiently:
% generate some test times/data
tstart = datenum('01-Jan-2008');
tend = datenum('31-Dec-2010');
tspan = tstart : tend;
tspan = tspan(:) + randn(size(tspan(:))); % add some noise so it's non-uniform
data = randn(size(tspan));
% find month numbers
[~,M] = datevec(tspan);
% find indices where the month changes
inds = find(diff([0; M]));
% extract data in columns
sz = numel(inds)-1;
cols = cell(sz,1);
for ii = 1:sz-1
cols{ii} = data( inds(ii) : inds(ii+1)-1 );
end
Note that it can be difficult to determine which entry in cols belongs to which month, year, so here's how to do it in a more human-readable way:
% change this line:
[y,M] = datevec(tspan);
% and change these lines:
cols = cell(sz,3);
for ii = 1:sz-1
cols{ii,1} = data( inds(ii) : inds(ii+1)-1 );
% also store the year and month
cols{ii,2} = y(inds(ii));
cols{ii,3} = M(inds(ii));
end
I'll assume you have a timeVals an Nx1 double vector holding the time value of each datum. Assuming data is also an Nx1 array. I also assume data and timeVals are sorted according to time: that is, the samples you have are ordered according to the time they were taken.
How about:
subs = #(x,i) x(:,i);
months = subs( datevec(timeVals), 2 ); % extract the month of year as a number from the time
r = find( months ~= [months(2:end), months(end)+1] );
monthOfCell = months( r );
r( 2:end ) = r( 2:end ) - r( 1:end-1 );
dataByMonth = mat2cell( data', r ); % might need to transpose data or r here...
timeByMonth = mat2cell( timeVal', r );
After running this code, you have a cell array dataByMonth each cell contains all data relevant to a specific month. The corresponding cell of timeByMonth holds the sampling times of the data of the respective month. Finally, monthOfCell tells you what is the month's number (1-12) of each cell.

For command + interpolation: need some tips

I have a matrix A with three columns: daily dates, prices, and hours - all same size vector - there are multiple prices associated to hours in a day.
sample data below:
A_dates = A_hours= A_prices=
[20080902 [9.698 [24.09
20080902 9.891 24.59
200080902 10.251 24.60
20080903 9.584 25.63
200080903 10.45 24.96
200080903 12.12 24.78
200080904 12.95 26.98
20080904 13.569 26.78
20080904] 14.589] 25.41]
Keep in my mind that I have about two years of daily data with about 10 000 prices per day that covers almost every minutes in a day from 9:30am to 16:00pm. Actually my initial dataset time was in milliseconds. I then converted my milliseconds in hours. I have some hours like 14.589 repeated three times with 3 different prices. Hence I did the following:
time=[A_dates,A_hours,A_prices];
[timeinhr,price]=consolidator(time,A_prices,'mean'); where timeinhr is both vector A_dates and A_hours
to take an average price at each say 14.589hours.
then for any missing hours with .25 .50 .75 and integer hours - I wish to interpolate.
For each date, hours repeat and I need to interpolate linearly prices that I don't have for some "wanted" hours. But of course I can't use the command interp1 if my hours repeats in my column because I have multiple days. So say:
%# here I want hours in 0.25unit increments (like 9.5hrs)
new_timeinhr = 0:0.25:max(A_hours));
day_hour = rem(new_timeinhour, 24);
%# Here I want only prices between 9.5hours and 16hours
new_timeinhr( day_hour <= 9.2 | day_hour >= 16.1 ) = [];
I then create a unique vectors of day and want to use a for and if command to interpolate daily and then stack my new prices in a vector one after the other:
days = unique(A_dates);
for j = 1:length(days);
if A_dates == days(j)
int_prices(j) = interp1(A_hours, A_prices, new_timeinhr);
end;
end;
My error is:
In an assignment A(I) = B, the number of elements in B and I must be the same.
How can I write the int_prices(j) to the stack?
I recommend converting your input to a single monotonic time value. Use the MATLAB datenum format, which represents one day as 1. There are plenty of advantages to this: You get the builtin MATLAB time/date functions, you get plot labels formatted nicely as date/time via datetick, and interpolation just works. Without test data, I can't test this code, but here's the general idea.
Based on your new information that dates are stored as 20080902 (I assume yyyymmdd), I've updated the initial conversion code. Also, since the layout of A is causing confusion, I'm going to refer to the columns of A as the vectors A_prices, A_hours, and A_dates.
% This datenum vector matches A. I'm assuming they're already sorted by date and time
At = datenum(num2str(A_dates), 'yyyymmdd') + datenum(0, 0, 0, A_hours, 0, 0);
incr = datenum(0, 0, 0, 0.25, 0, 0); % 0.25 hour
t = (At(1):incr:At(end)).'; % Full timespan of dataset, in 0.25 hour increments
frac_hours = 24*(t - floor(t)); % Fractional hours into the day
t_business_day = t((frac_hours > 9.4) & (frac_hours < 16.1)); % Time vector only where you want it
P = interp1(At, A_prices, t_business_day);
I repeat, since there's no test data, I can't test the code. I highly recommend testing the date conversion code by using datestr to convert back from the datenum to readable dates.
Converting days/hours to serial date numbers, as suggested by #Peter, is definitely the way to go. Based on his code (which I already upvoted), I present below a simple example.
First I start by creating some fake data resembling what you described (with some missing parts as well):
%# three days in increments of 1 hour
dt = datenum(num2str((0:23)','2012-06-01 %02d:00'), 'yyyy-mm-dd HH:MM'); %#'
dt = [dt; dt+1; dt+2];
%# price data corresponding to each hour
p = cumsum(rand(size(dt))-0.5);
%# show plot
plot(dt, p, '.-'), datetick('x')
grid on, xlabel('Date/Time'), ylabel('Prices')
%# lets remove some rows as missing
idx = ( rand(size(dt)) < 0.1 );
hold on, plot(dt(idx), p(idx), 'ro'), hold off
legend({'prices','missing'})
dt(idx) = [];
p(idx) = [];
%# matrix same as yours: days,prices,hours
ymd = str2double( cellstr(datestr(dt,'yyyymmdd')) );
hr = str2double( cellstr(datestr(dt,'HH')) );
A = [ymd p hr];
%# let clear all variables except the data matrix A
clearvars -except A
Next we interpolate the price data across the entire range in 15 minutes increments:
%# convert days/hours to serial date number
dt = datenum(num2str(A(:,[1 3]),'%d %d'), 'yyyymmdd HH');
%# create a vector of 15 min increments
t_15min = (0:0.25:(24-0.25))'; %#'
tt = datenum(0,0,0, t_15min,0,0);
%# offset serial date across all days
ymd = datenum(num2str(unique(A(:,1))), 'yyyymmdd');
tt = bsxfun(#plus, ymd', tt); %#'
tt = tt(:);
%# interpolate data at new datetimes
pp = interp1(dt, A(:,2), tt);
%# extract desired period of time from each day
idx = (9.5 <= t_15min & t_15min <= 16);
idx2 = bsxfun(#plus, find(idx), (0:numel(ymd)-1)*numel(t_15min));
P = pp(idx2(:));
%# plot interpolated data, and show extracted periods
figure, plot(tt, pp, '.-'), datetick('x'), hold on
plot([tt(idx2);nan(1,numel(ymd))], [pp(idx2);nan(1,numel(ymd))], 'r.-')
hold off, grid on, xlabel('Date/Time'), ylabel('Prices')
legend({'interpolated prices','period of 9:30 - 16:00'})
and here are the two plots showing the original and interpolated data:
I think I might have solved it this way:
new_timeinhr = 0:0.25:max(A(:,2));
day_hour = rem(new_timeinhr, 24);
new_timeinhr( day_hour <= 9.4 | day_hour >= 16.1 ) = [];
days=unique(data(:,1));
P=[];
for j=1:length(days);
condition=A(:,1)==days(j);
intprices = interp1(A(condition,2), A(condition,3), new_timeinhr);
P=vertcat(P,intprices');
end;

Bucketing Algorithm

I've got some code that works, but is a bit of a bottleneck, and I'm stuck trying to figure out how to speed it up. It's in a loop, and I can't figure how to vectorize it.
I've got a 2D array, vals, that represents timeseries data. Rows are dates, columns are different series. I'm trying to bucket the data by months to perform various operations on it (sum, mean, etc). Here is my current code:
allDts; %Dates/times for vals. Size is [size(vals, 1), 1]
vals;
[Y M] = datevec(allDts);
fomDates = unique(datenum(Y, M, 1)); %first of the month dates
[Y M] = datevec(fomDates);
nextFomDates = datenum(Y, M, DateUtil.monthLength(Y, M)+1);
newVals = nan(length(fomDates), size(vals, 2)); %preallocate for speed
for k = 1:length(fomDates);
This next line is the bottleneck because I call it so many times.(looping)
idx = (allDts >= fomDates(k)) & (allDts < nextFomDates(k));
bucketed = vals(idx, :);
newVals(k, :) = nansum(bucketed);
end %for
Any Ideas? Thanks in advance.
That's a difficult problem to vectorize. I can suggest a way to do it using CELLFUN, but I can't guarantee that it will be faster for your problem (you would have to time it yourself on the specific data sets you are using). As discussed in this other SO question, vectorizing doesn't always work faster than for loops. It can be very problem-specific which is the best option. With that disclaimer, I'll suggest two solutions for you to try: a CELLFUN version and a modification of your for-loop version that may run faster.
CELLFUN SOLUTION:
[Y,M] = datevec(allDts);
monthStart = datenum(Y,M,1); % Start date of each month
[monthStart,sortIndex] = sort(monthStart); % Sort the start dates
[uniqueStarts,uniqueIndex] = unique(monthStart); % Get unique start dates
valCell = mat2cell(vals(sortIndex,:),diff([0 uniqueIndex]));
newVals = cellfun(#nansum,valCell,'UniformOutput',false);
The call to MAT2CELL groups the rows of vals that have the same start date together into cells of a cell array valCell. The variable newVals will be a cell array of length numel(uniqueStarts), where each cell will contain the result of performing nansum on the corresponding cell of valCell.
FOR-LOOP SOLUTION:
[Y,M] = datevec(allDts);
monthStart = datenum(Y,M,1); % Start date of each month
[monthStart,sortIndex] = sort(monthStart); % Sort the start dates
[uniqueStarts,uniqueIndex] = unique(monthStart); % Get unique start dates
vals = vals(sortIndex,:); % Sort the values according to start date
nMonths = numel(uniqueStarts);
uniqueIndex = [0 uniqueIndex];
newVals = nan(nMonths,size(vals,2)); % Preallocate
for iMonth = 1:nMonths,
index = (uniqueIndex(iMonth)+1):uniqueIndex(iMonth+1);
newVals(iMonth,:) = nansum(vals(index,:));
end
If all you need to do is form the sum or mean on rows of a matrix, where the rows are summed depending upon another variable (date) then use my consolidator function. It is designed to do exactly this operation, reducing data based on the values of an indicator series. (Actually, consolidator can also work on n-d data, and with a tolerance, but all you need to do is pass it the month and year information.)
Find consolidator on the file exchange on Matlab Central