Using matlab to arrange and sort data - matlab

My data is an excel file with two columns in this format:
Date Type
3/12/06 A
3/12/06 B
3/12/06 B
3/12/06 C
6/01/07 A
6/01/07 A
8/01/07 B
...
Column A are dates and can be repeated while column B are types of observations on these dates.
In MATLAB I want to plot each type as a function of time, however first I need to arrange my data. There are often multiple identical rows that correspond to multiple observations of the same type on the same date. So I think first I need to count how many times a certain type occurred on the same day?
Any help would be great! I'm still at the stage of trying to read the dates in the correct format...

Here is a solution: I replace each type and each date with a specific index and then I use accumarray in order to create a 2D pivot table. You can also directly use the function pivot table from excel.
% We load the xls file.
[~,txt] = xlsread('test.xls');
% We delete the header:
txt(1,:) = [];
% Value and index for the date:
[val_d,~,ind_d] = unique(txt(:,1));
% Value and index for the type:
[val_c,~,ind_c] = unique(txt(:,2));
% We use accumarray to create a pivot table that count each occurence.
acc = accumarray([ind_d,ind_c],1)
% Then we simply plot the result:
dateFormat = 'dd/mm/yy';
for i = 1:length(val_c);
subplot(1,length(val_c),i)
bar(datenum(val_d,dateFormat),acc(:,i),1) % easier to deal with datenum
datetick('x',dateFormat)
xlabel('Date')
ylabel([val_c{i},' count'])
ylim([0,3])
end
RESULT:

Related

MATLAB Table Datetime loop indexing

I am trying to take my excel spreadsheet and import it into MATLAB (already accomplished that), and then using for-loop indexing to create arrays of the data for a give day containing.
So ideally I would like to know how I could iterate through a years worth of data, and create variables with the date corresponding to the table elements on that day. As I said, I have multiple years worth of data, which is why I'd like a solution which would "automate" my process.
Welcome to stackoverflow! Note that it is easier if you provide some code -- at least to create some sample data.
Anyway, you can loop over dates easily and there shouldn't be a problem with efficiency if you scale it to many entries:
Tm = [ datetime('now') + duration(1,0,0)%add 1 hour
datetime('now')
datetime('yesterday')
datetime('tomorrow')];
% convert to date
Dt = yyyymmdd(Tm);
% you may want to sort it
% [val,idx] = sort(Dt);
% get unique dates
Dt_uq = unique(Dt);
% create a cell of storage
DataAtDate = cell(length(Dt_uq),1);
% loop over unique dates
for i = 1:length(Dt_uq)
Dt1 = Dt_uq(i);
% get all of the same type
lg = Dt == Dt1;
% index matrix/cell to do something
disp(Dt(lg,:))
% do something with the data or matrix or table... e.g. store it in a cell
DataAtDate{i} = Dt(lg,:);
end

Importing Excel Data with Dates and Values and Plotting Time Series in Matlab

I would like to plot a time series in Matlab of a data set I have in Excel.
The Excel file looks as follows:
Data: | Value:
2005-04-01 | 5.20
2006-12-02 | 3.12
...
How could I load this into Matlab and plot the time series of it?
There's 2 easy way of plotting dates, but I'll give you the script to read from the xls file first.
% Read from Excel
[N,T] = xlsread( filepath );
You then need to extract/convert the dates as follows. Dates are the 1st column of the text.
d = datetime( T(:,1) );
Then you can plot the variables as follows
figure;
plot( d, N(:,1) );
A sample plot is here
Alternatively, you can use datenum instead of datetime if you want the date as an integer instead of a datetime object using the following line.
d = datenum( T(:,1) );
use xlsread to load in data as string, now converting the date and values can be done in a number of ways, the most strait forward way for your dates is perhaps using the str2num which allows you to read in the N th character as a number, for example:
string = "2005-04-01|5.20"
year = str2num(string(1:4))
month = str2num(string(6:7)) //%where the number is the N th character in string, and has to be numerical or the str2num will return error
//%Note that this method does not read non-numerical strings, in your case "-" "|" will not be read.
I believe this is one of the faster methods, since it mainly involve a type conversion on a standard format of data. Any other suggestions?
PS: in your time series plot, I would suggest converting your dates into a long number, i.e. 20050401, 20061202, etc; which would be most efficient. (you can do this by years*1000+months*100+days.

Matlab - Access index of max value in for loop and use it to remove values from array

I would like to recursively find the maximum value in a series of matrices (column 8, to be specific), then use the index of that maximum value to set all values in the array with index up to the max index to NaN (for columns 14:16). It is straight forward to find the max value and index, but using a for loop to do it for multiple arrays I am stumped.
Here is how I can do it without a for loop:
[C,Max] = max(wy2000(:,8));
wy2000(1:Max,14:16) = NaN;
[C,Max] = max(wy2001(:,8));
wy2001(1:Max,14:16) = NaN;
[C,Max] = max(wy2002(:,8));
wy2002(1:Max,14:16) = NaN;
and so on and so forth...
Here are two ways I have tried using a for loop:
startyear = 2000;
endyear = 2009;
for n=startyear:endyear
currentYear = sprintf('wy%d',n);
[C,Max] = max(currentYear(:,8));
currentYear(1:Max,14:16) = NaN;
end
Here is another way I tried, using the eval function
for n=2000:2009;
currentYear = ['wy' int2str(n)];
var2 = ['maxswe' int2str(n)];
eval([var2 ' = max(currentYear(:,8))']);
end
In both cases, the problem seems to be that MATLAB doesn't recognize the 'currentYear' variable to be the array that corresponds to the wyXXXX that I already have created in my workspace.
Based on Peters answer, here is some more info about my data. I am starting with a matrix of data called all_data which holds 16 columns of data, spanning the time period 1982 - 2012. I am only interested in the period 2000 - 2009, and I am also interested in analyzing each year individually (2000, 2001,...,2009).
To get the data into individual years, I use the following code:
for n=2000:2009;
s = datenum(n-1,10,1);
e = datenum(n,9,30);
startcell = find(TIME(:,7)==s);
endcell = find(TIME(:,7)==e);
var1 = ['wy' int2str(n)];
eval([var1 '= all_data3(startcell:endcell,:)']);
eval(['save ', var1]);
end
For clarification, it is the period 10/1/YEAR1 to 9/30/YEAR2 that I am interested in, and TIME is a matrix holding the dates and times of my data.
So at the end of the above for-loop, I have a new matrix for each water-year (wy). I then want to find the date of maximum snow-accumulation (column 8) and exclude all data prior to that date from my analysis. this is where the original question comes from.
Peter's solution works, but I was hoping to find a more simple solution to find the max date and set the values prior to that date to NaN, without having to declare a bunch of variables (or entries in a cell array).
If I could write a loop that would create the cell array that Peter suggested based on a start and end year, that would make the code transferable to other datasets, but when i try to do this I run into the issue that the index for the cell-array is 1:length(years), but the wy arrays are named according to the actual year, so there is an inconsistency when using the eval function.
Matt
You've discovered the problem with eval and dynamically named variables. They're messy. I'd recommend recoding this as a cell array, with the cell array index being the index for the year:
years = 2000:2009;
wy{1} = wy2000;
wy{2} = wy2001;
% etc...
% Then,
for n=1:length(years)
[C, maxval] = max(wy{n}(:,8));
% etc.
end
You really only need the actual year when you input the data and when you display it. Now, if you're starting from a huge pile of arrays already named this way, that's the time to use eval: to convert them into this form that's easier to use. Just form the eval strings so they read, for example, 'wy{1} = wy2000;'

Run time series analysis for multiple series Matlab

I wish to apply the same data analysis to multiple data time series. However the number of data series is variable. So instead of hard-coding each series' analysis I would like to be able to specify the number and name of the funds and then have the same data-manipulation done to all before they are combined into a single portfolio.
Specifically I have an exel file where each worksheet is a time series where the first column is dates and the second column is prices. The dates for all funds may not correspond so the individual worksheets must be sifted for dates that occur in all the funds before combining into one data set where there is one column of dates and all other columns correspond to the data of each of the present funds.
This combined data set is then analysed for means and variances etc.
I currently have worked out how to carry out the merging and the analysis (below) but I would like to know how I can simply add or remove funds (i.e. by including new worksheets containing individual funds data in the excel file) without having to re-write and add/ remove extra/excess matlab code.
*% LOAD DATA*
XL='XLData.xlsx';
formatIn = 'dd/mm/yyyy';
formatOut = 'mmm-dd-yyyy';
*%SPECIFY WORKSHEETS*
Fund1Prices=3;
Fund2Prices=4;
*%RETRIEVE VALUES*
[Fund1values, ~, Fund1sheet] = xlsread(XL,Fund1Prices);
[Fund2values, ~, Fund2sheet] = xlsread(XL,Fund2Prices);
*%EXTRACT DATES AND DATA AND COMBINE (TO REMOVE UNNECCESSARY TEXT IN ROWS 1
%TO 4) FOR FUND 1.*
Fund1_dates_data=Fund1sheet(4:end,1:2);
Fund1Dates= cellstr(datestr(datevec(Fund1_dates_data(:,1),formatIn),formatOut));
Fund1Data= cell2mat(Fund1_dates_data(:,2));
*%EXTRACT DATES AND DATA AND COMBINE (TO REMOVE UNNECCESSARY TEXT IN ROWS 1
%TO 4) FOR FUND 2.*
Fund2_dates_data=Fund2sheet(4:end,1:2);
Fund2Dates= cellstr(datestr(datevec(Fund2_dates_data(:,1),formatIn),formatOut));
Fund2Data= cell2mat(Fund2_dates_data(:,2));
*%CREATE TIME SERIES FOR EACH FUND*
Fund1ts=fints(Fund1Dates,Fund1Data,'Fund1');
Fund2ts=fints(Fund2Dates,Fund2Data,'Fund2');
*%CREATE PORTFOLIO*
Port=merge(Fund1ts,Fund2ts,'DateSetMethod','Intersection');
*%ANALYSE PORTFOLIO*
Returns=tick2ret(Port);
q = Portfolio;
q = q.estimateAssetMoments(Port)
[qassetmean, qassetcovar] = q.getAssetMoments
Based on edit to the question, the answer was rewritten
You can put your code into a function. This function can be saved as an .m-file and called from Matlab.
However, you want to replace the calls to specific worksheets (Fund1Prices=3) with an automated way of figuring out how many worksheets there are. Here's one way of how to do that in a function:
function [Returns,q,qassetmean,qassetcovar] = my_data_series_analysis(XL)
% All input this function requires is a variable
% containing the name of the xls-file you want to process
formatIn = 'dd/mm/yyyy';
formatOut = 'mmm-dd-yyyy';
% Determine the number of worksheets in the xls-file:
[~,my_sheets] = xlsfinfo(XL);
% Loop through the number of sheets
% (change the start value if the first sheets do not contain data):
% this is needed to merge your portfolio
% in case you do not start the for-loop at I=1
merge_count = 1;
for I=1:size(my_sheets,2)
% RETRIEVE VALUES
% note that Fund1Prices has been replaced with the loop-iterable, I
[FundValues, ~, FundSheet] = xlsread(XL,I);
% EXTRACT DATES AND DATA AND COMBINE
% (TO REMOVE UNNECCESSARY TEXT IN ROWS 1 TO 4)
Fund_dates_data = FundSheet(4:end,1:2);
FundDates = cellstr(datestr(datevec(Fund_dates_data(:,1),...
formatIn),formatOut));
FundData = cell2mat(Fund_dates_data(:,2));
% CREATE TIME SERIES FOR EACH FUND
Fundts{I}=fints(FundDates,FundData,['Fund',num2str(I)]);
if merge_count == 2
Port = merge(Fundts{I-1},Fundts{I},'DateSetMethod','Intersection');
end
if merge_count > 2
Port = merge(Port,Fundts{I},'DateSetMethod','Intersection');
end
merge_count = merge_count + 1;
end
% ANALYSE PORTFOLIO
Returns=tick2ret(Port);
q = Portfolio;
q = q.estimateAssetMoments(Port)
[qassetmean, qassetcovar] = q.getAssetMoments
This function will return the Returns, q, qassetmean and qassetcovar variables for all the worksheets in the xls-file you want to process. The variable XL should be specified like this:
XL = 'my_file.xls';
You can also loop over more than one xls-file. Like this:
% use a cell so that the file names can be of different length:
XL = {'my_file.xls'; 'my_file2.xls'}
for F=1:size(XL,1)
[Returns{F},q{F},qassetmean{F},qassetcovar{F}] = my_data_series_analysis(XL{F,1});
end
Make sure to store the values which are returned from the function in cells (as shown) or structs (not shown) to account for the fact that there may be a different number of sheets per file.

Bucketing Algorithm

I've got some code that works, but is a bit of a bottleneck, and I'm stuck trying to figure out how to speed it up. It's in a loop, and I can't figure how to vectorize it.
I've got a 2D array, vals, that represents timeseries data. Rows are dates, columns are different series. I'm trying to bucket the data by months to perform various operations on it (sum, mean, etc). Here is my current code:
allDts; %Dates/times for vals. Size is [size(vals, 1), 1]
vals;
[Y M] = datevec(allDts);
fomDates = unique(datenum(Y, M, 1)); %first of the month dates
[Y M] = datevec(fomDates);
nextFomDates = datenum(Y, M, DateUtil.monthLength(Y, M)+1);
newVals = nan(length(fomDates), size(vals, 2)); %preallocate for speed
for k = 1:length(fomDates);
This next line is the bottleneck because I call it so many times.(looping)
idx = (allDts >= fomDates(k)) & (allDts < nextFomDates(k));
bucketed = vals(idx, :);
newVals(k, :) = nansum(bucketed);
end %for
Any Ideas? Thanks in advance.
That's a difficult problem to vectorize. I can suggest a way to do it using CELLFUN, but I can't guarantee that it will be faster for your problem (you would have to time it yourself on the specific data sets you are using). As discussed in this other SO question, vectorizing doesn't always work faster than for loops. It can be very problem-specific which is the best option. With that disclaimer, I'll suggest two solutions for you to try: a CELLFUN version and a modification of your for-loop version that may run faster.
CELLFUN SOLUTION:
[Y,M] = datevec(allDts);
monthStart = datenum(Y,M,1); % Start date of each month
[monthStart,sortIndex] = sort(monthStart); % Sort the start dates
[uniqueStarts,uniqueIndex] = unique(monthStart); % Get unique start dates
valCell = mat2cell(vals(sortIndex,:),diff([0 uniqueIndex]));
newVals = cellfun(#nansum,valCell,'UniformOutput',false);
The call to MAT2CELL groups the rows of vals that have the same start date together into cells of a cell array valCell. The variable newVals will be a cell array of length numel(uniqueStarts), where each cell will contain the result of performing nansum on the corresponding cell of valCell.
FOR-LOOP SOLUTION:
[Y,M] = datevec(allDts);
monthStart = datenum(Y,M,1); % Start date of each month
[monthStart,sortIndex] = sort(monthStart); % Sort the start dates
[uniqueStarts,uniqueIndex] = unique(monthStart); % Get unique start dates
vals = vals(sortIndex,:); % Sort the values according to start date
nMonths = numel(uniqueStarts);
uniqueIndex = [0 uniqueIndex];
newVals = nan(nMonths,size(vals,2)); % Preallocate
for iMonth = 1:nMonths,
index = (uniqueIndex(iMonth)+1):uniqueIndex(iMonth+1);
newVals(iMonth,:) = nansum(vals(index,:));
end
If all you need to do is form the sum or mean on rows of a matrix, where the rows are summed depending upon another variable (date) then use my consolidator function. It is designed to do exactly this operation, reducing data based on the values of an indicator series. (Actually, consolidator can also work on n-d data, and with a tolerance, but all you need to do is pass it the month and year information.)
Find consolidator on the file exchange on Matlab Central