Find rows within multiple arrays that have the same header and remove all other rows, using Matlab - matlab

In Matlab, I have several txt files that I have loaded and converted to matrices. The matrices represent temperature data at different cities around the world. The first column in each matrix is a year. Each file spans a different range of years but they all overlap for a few of those years. I would like to find where the overlap, and either extract out (or delete non-overlapping years) so that when I plot the data, each data set is using the same span of years. The code should be able to ingest an unknown number of these txt files. I have tried to use the "intersect" function but that will work on an element-by-element basis. I want all data for overlapping years, so the elements (except for the header) will be different.
An example of current code:
clear all
files = dir('.txt');
num_files = length(files);
mintersect(files);
for i=1:num_files
eval(['load ' files(i).name ' -ascii']);
vals{i} = load(files(i).name);
matrix = vals{i};
station = (files(i).name(1:end-4));
matrix(matrix == 999.9) = NaN;
matrix(matrix == -99.0) = NaN;
years = matrix(:,1);
months = matrix(:,2:13)';
figure, hold on
plot(years, months,'');
ylabel('Temp.');
xlabel('Years');
grid on;
title(sprintf('Mean Monthly Temperature for %s Station',station));
end

Related

Matlab interp1 gives last row as NaN

I have a problem similar to here. However, it doesn't seem that there is a resolution.
My problem is as such: I need to import some files, for example, 5. There are 20 columns in each file, but the number of lines are varied. Column 1 is time in terms of crank-angle degrees, and the rest are data.
So my code first imports all of the files, finds the file with the most number of rows, then creates a multidimensional array with that many rows. The timing is in engine cycles so, I would then remove lines from the imported file that go beyond a whole engine cycle. This way, I always have data in terms of X whole engine cycles. Then I would just interpolate the data to the pre-allocated array to have a giant multi-dimensional array for the 5 data files.
However, this seems to always result in the last row of every column of every page being filled with NaNs. Please have a look at the code below. I can't see where I'm doing wrong. Oh, and by the way, as I have been screwed over before, this is NOT homework.
maxlines = 0;
maxcycle = 999;
for i = 1:1
filename = sprintf('C:\\Directory\\%s\\file.out',folder{i});
file = filelines(filename); % Import file clean
lines = size(file,1); % Find number of lines of file
if lines > maxlines
maxlines = lines; % If number of lines in this file is the most, save it
end
lastCAD = file(end,1); % Add simstart to shift the start of the cycle to 0 CAD
lastcycle = fix((lastCAD-simstart)./cycle); % Find number of whole engine cycles
if lastcycle < maxcycle
maxcycle = lastcycle; % Find lowest number of whole engine cycles amongst all designs
end
cols = size(file,2); % Find number of columns in files
end
lastcycleCAD = maxcycle.*cycle+simstart; % Define last CAD of whole cycle that can be used for analysis
% Import files
thermo = zeros(maxlines,cols,designs); % Initialize array to proper size
xq = linspace(simstart,lastcycleCAD,maxlines); % Define the CAD degrees
for i = 1:designs
filename = sprintf('C:\\Directory\\%s\\file.out',folder{i});
file = importthermo(filename, 6, inf); % Import the file clean
[~,lastcycleindex] = min(abs(file(:,1)-lastcycleCAD)); % Find index of end of last whole cycle
file = file(1:lastcycleindex,:); % Remove all CAD after that
thermo(:,1,i) = xq;
for j = 2:17
thermo(:,j,i) = interp1(file(:,1),file(:,j),xq);
end
sprintf('file from folder %s imported OK',folder{i})
end
thermo(end,:,:) = []; % Remove NaN row
Thank you very much for your help!
Are you sampling out of the range? if so, you need to tell interp1 that you want extrapolation
interp1(file(:,1),file(:,j),xq,'linear','extrap');

Only Import File when it contains certain numbers from a Table

I got a couple 100 sensor measurement files all containing the date and time of measurement. All the files have names that include date and time. Example:
07-06-2016_17-58-32.wf
07-06-2016_18-02-32.wf
...
...
08-06-2016_17:48-26.wf
I have a function (importfile) and a loop that imports my data. The loop looks like this:
Files = dir('C:\Osci\User\*.waveform');
numFiles = length(Files);
Data = cell(1, numFiles);
for fileNum = 1:numFiles
Data{fileNum} = importfile(Files(fileNum).name);
end
Not all of these waveform files are useful. The measurement files are only useful if they were generated in a certain time period. I got a table that shows my allowed time periods:
07-Jun-2016 18:00:01
07-Jun-2016 18:01:31
07-Jun-2016 18:02:01
...
I want to modify my loop, so that the files (.waveform files) are only imported if the numbers for day (first number), hour (4th number) and minute (5th number) from the files match the numbers of the table containing the allowed time periods.
EDIT: Rather than a scalar hour, minute, and second, there is a vector of each. In my case, MyDay, MyHour and MyMinute are 1100x1 matrices while fileTimes only consists of 361 rows.
So, using the provided example the loop should only import file
07-06-2016_18-02-32.wf
since it is the only one where the numbers match (in this case 7, 18, 02).
EDIT2: Using #erfan's answer (and changing some directories and variable names) I have the following working code:
fmtstr = 'O:\\Basic_Research_All\\Lange\\Skripe ISAT\\Rohdaten\\*_%02i-*-*_%02i-%02i-*.wf';
Files = struct([]);
n = size(MyDayMyHourMyMinute);
for N = 1:n;
Files = [Files; dir(sprintf(fmtstr, MyDayMyHourMyMinute(N,:)))];
end
numFiles = length(Files);
WaveformData = cell(1, numFiles);
for fileNum = 1:numFiles
WaveformData{fileNum} = importfile(Files(fileNum).name);
end
Since your filenames are pretty well defined as dates and times, you can prefilter your list by turning them into actual dates and times:
% Get the file list
Files = dir('C:\Osci\User\*.waveform');
% You only need the names
Files = {Files.name};
% Get just the filename w/o the extension
[~, baseFileNames] = cellfun(#(x) fileparts(x), Files, 'UniformOutput', false);
% Your filename is just a date, so parse it as such
fileTimes = datevec(baseFileNames, 'mm-dd-yyyy_HH-MM-SS');
% Now pick out the files you want
% goodFiles = fileTimes(:, 4) == myHour & fileTimes(:, 5) == myMinute & fileTimes(:, 6) == mySecond;
goodFiles = ismember(fileTimes(:, 4:6), [myHour(:), myMinute(:), mySecond(:)], 'rows');
% Pare down your list of filenames
Files = Files(goodFiles);
% Preallocate your data cell
Data = cell(1, numel(Files));
% Now do your loop
for idx = 1:numel(Data)
Data{idx} = importfile(Files{idx});
end
You will, of course, need to define myHour, myMinute and mySecond. Of course, using the logical indexing in goodFiles, you could impose any sort of time criteria, like time or date range. If you find that your filenames aren't so well defined, you could parse out the filename using textscan or strfind to get the bits you want. The important thing is that cell arrays can be indexed into in much the same way as numerical or string arrays and it's often better to vectorize your filter criteria and then only do the loop on the parts you have to.
The OP indicated in a comment below that rather than a scalar hour, minute, and second, there is a vector of each. In that case, use ismember to match the two time vectors and return a logical index vector. With 2015a, MathWorks introduced the function ismembertol, which allows one to check membership within a certain tolerance.
You can apply your selection from the beginning. Imagine the acceptable values for day, hour and minute are saved in acc as an n*3 matrix. If you replace the first line of your code with:
fmtstr = 'C:\Osci\User\%02i-*-*_%02i-%02i-*.wf';
Files = struct([]);
for ii = 1:n
Files = [Files; dir(sprintf(fmtstr, acc(ii,:)))];
end
Then you have already applied your criteria to Files. The rest is the same.

Matlab - Access index of max value in for loop and use it to remove values from array

I would like to recursively find the maximum value in a series of matrices (column 8, to be specific), then use the index of that maximum value to set all values in the array with index up to the max index to NaN (for columns 14:16). It is straight forward to find the max value and index, but using a for loop to do it for multiple arrays I am stumped.
Here is how I can do it without a for loop:
[C,Max] = max(wy2000(:,8));
wy2000(1:Max,14:16) = NaN;
[C,Max] = max(wy2001(:,8));
wy2001(1:Max,14:16) = NaN;
[C,Max] = max(wy2002(:,8));
wy2002(1:Max,14:16) = NaN;
and so on and so forth...
Here are two ways I have tried using a for loop:
startyear = 2000;
endyear = 2009;
for n=startyear:endyear
currentYear = sprintf('wy%d',n);
[C,Max] = max(currentYear(:,8));
currentYear(1:Max,14:16) = NaN;
end
Here is another way I tried, using the eval function
for n=2000:2009;
currentYear = ['wy' int2str(n)];
var2 = ['maxswe' int2str(n)];
eval([var2 ' = max(currentYear(:,8))']);
end
In both cases, the problem seems to be that MATLAB doesn't recognize the 'currentYear' variable to be the array that corresponds to the wyXXXX that I already have created in my workspace.
Based on Peters answer, here is some more info about my data. I am starting with a matrix of data called all_data which holds 16 columns of data, spanning the time period 1982 - 2012. I am only interested in the period 2000 - 2009, and I am also interested in analyzing each year individually (2000, 2001,...,2009).
To get the data into individual years, I use the following code:
for n=2000:2009;
s = datenum(n-1,10,1);
e = datenum(n,9,30);
startcell = find(TIME(:,7)==s);
endcell = find(TIME(:,7)==e);
var1 = ['wy' int2str(n)];
eval([var1 '= all_data3(startcell:endcell,:)']);
eval(['save ', var1]);
end
For clarification, it is the period 10/1/YEAR1 to 9/30/YEAR2 that I am interested in, and TIME is a matrix holding the dates and times of my data.
So at the end of the above for-loop, I have a new matrix for each water-year (wy). I then want to find the date of maximum snow-accumulation (column 8) and exclude all data prior to that date from my analysis. this is where the original question comes from.
Peter's solution works, but I was hoping to find a more simple solution to find the max date and set the values prior to that date to NaN, without having to declare a bunch of variables (or entries in a cell array).
If I could write a loop that would create the cell array that Peter suggested based on a start and end year, that would make the code transferable to other datasets, but when i try to do this I run into the issue that the index for the cell-array is 1:length(years), but the wy arrays are named according to the actual year, so there is an inconsistency when using the eval function.
Matt
You've discovered the problem with eval and dynamically named variables. They're messy. I'd recommend recoding this as a cell array, with the cell array index being the index for the year:
years = 2000:2009;
wy{1} = wy2000;
wy{2} = wy2001;
% etc...
% Then,
for n=1:length(years)
[C, maxval] = max(wy{n}(:,8));
% etc.
end
You really only need the actual year when you input the data and when you display it. Now, if you're starting from a huge pile of arrays already named this way, that's the time to use eval: to convert them into this form that's easier to use. Just form the eval strings so they read, for example, 'wy{1} = wy2000;'

Run time series analysis for multiple series Matlab

I wish to apply the same data analysis to multiple data time series. However the number of data series is variable. So instead of hard-coding each series' analysis I would like to be able to specify the number and name of the funds and then have the same data-manipulation done to all before they are combined into a single portfolio.
Specifically I have an exel file where each worksheet is a time series where the first column is dates and the second column is prices. The dates for all funds may not correspond so the individual worksheets must be sifted for dates that occur in all the funds before combining into one data set where there is one column of dates and all other columns correspond to the data of each of the present funds.
This combined data set is then analysed for means and variances etc.
I currently have worked out how to carry out the merging and the analysis (below) but I would like to know how I can simply add or remove funds (i.e. by including new worksheets containing individual funds data in the excel file) without having to re-write and add/ remove extra/excess matlab code.
*% LOAD DATA*
XL='XLData.xlsx';
formatIn = 'dd/mm/yyyy';
formatOut = 'mmm-dd-yyyy';
*%SPECIFY WORKSHEETS*
Fund1Prices=3;
Fund2Prices=4;
*%RETRIEVE VALUES*
[Fund1values, ~, Fund1sheet] = xlsread(XL,Fund1Prices);
[Fund2values, ~, Fund2sheet] = xlsread(XL,Fund2Prices);
*%EXTRACT DATES AND DATA AND COMBINE (TO REMOVE UNNECCESSARY TEXT IN ROWS 1
%TO 4) FOR FUND 1.*
Fund1_dates_data=Fund1sheet(4:end,1:2);
Fund1Dates= cellstr(datestr(datevec(Fund1_dates_data(:,1),formatIn),formatOut));
Fund1Data= cell2mat(Fund1_dates_data(:,2));
*%EXTRACT DATES AND DATA AND COMBINE (TO REMOVE UNNECCESSARY TEXT IN ROWS 1
%TO 4) FOR FUND 2.*
Fund2_dates_data=Fund2sheet(4:end,1:2);
Fund2Dates= cellstr(datestr(datevec(Fund2_dates_data(:,1),formatIn),formatOut));
Fund2Data= cell2mat(Fund2_dates_data(:,2));
*%CREATE TIME SERIES FOR EACH FUND*
Fund1ts=fints(Fund1Dates,Fund1Data,'Fund1');
Fund2ts=fints(Fund2Dates,Fund2Data,'Fund2');
*%CREATE PORTFOLIO*
Port=merge(Fund1ts,Fund2ts,'DateSetMethod','Intersection');
*%ANALYSE PORTFOLIO*
Returns=tick2ret(Port);
q = Portfolio;
q = q.estimateAssetMoments(Port)
[qassetmean, qassetcovar] = q.getAssetMoments
Based on edit to the question, the answer was rewritten
You can put your code into a function. This function can be saved as an .m-file and called from Matlab.
However, you want to replace the calls to specific worksheets (Fund1Prices=3) with an automated way of figuring out how many worksheets there are. Here's one way of how to do that in a function:
function [Returns,q,qassetmean,qassetcovar] = my_data_series_analysis(XL)
% All input this function requires is a variable
% containing the name of the xls-file you want to process
formatIn = 'dd/mm/yyyy';
formatOut = 'mmm-dd-yyyy';
% Determine the number of worksheets in the xls-file:
[~,my_sheets] = xlsfinfo(XL);
% Loop through the number of sheets
% (change the start value if the first sheets do not contain data):
% this is needed to merge your portfolio
% in case you do not start the for-loop at I=1
merge_count = 1;
for I=1:size(my_sheets,2)
% RETRIEVE VALUES
% note that Fund1Prices has been replaced with the loop-iterable, I
[FundValues, ~, FundSheet] = xlsread(XL,I);
% EXTRACT DATES AND DATA AND COMBINE
% (TO REMOVE UNNECCESSARY TEXT IN ROWS 1 TO 4)
Fund_dates_data = FundSheet(4:end,1:2);
FundDates = cellstr(datestr(datevec(Fund_dates_data(:,1),...
formatIn),formatOut));
FundData = cell2mat(Fund_dates_data(:,2));
% CREATE TIME SERIES FOR EACH FUND
Fundts{I}=fints(FundDates,FundData,['Fund',num2str(I)]);
if merge_count == 2
Port = merge(Fundts{I-1},Fundts{I},'DateSetMethod','Intersection');
end
if merge_count > 2
Port = merge(Port,Fundts{I},'DateSetMethod','Intersection');
end
merge_count = merge_count + 1;
end
% ANALYSE PORTFOLIO
Returns=tick2ret(Port);
q = Portfolio;
q = q.estimateAssetMoments(Port)
[qassetmean, qassetcovar] = q.getAssetMoments
This function will return the Returns, q, qassetmean and qassetcovar variables for all the worksheets in the xls-file you want to process. The variable XL should be specified like this:
XL = 'my_file.xls';
You can also loop over more than one xls-file. Like this:
% use a cell so that the file names can be of different length:
XL = {'my_file.xls'; 'my_file2.xls'}
for F=1:size(XL,1)
[Returns{F},q{F},qassetmean{F},qassetcovar{F}] = my_data_series_analysis(XL{F,1});
end
Make sure to store the values which are returned from the function in cells (as shown) or structs (not shown) to account for the fact that there may be a different number of sheets per file.

matlab averages of cell array

I have the following script for importing text files into matlab which include hourly data, where I am then trying to convert them into daily averages:
clear all
pathName = ...
TopFolder = pathName;
dirListing = dir(fullfile(TopFolder,'*.txt'));%Lists the folders in the directory specified by pathName.
for i = 1:length(dirListing);
SubFolder{i} = dirListing(i,1).name;%obtain the name of each folder in
%the specified path.
end
%import data
for i=1:length(SubFolder);
rawData1{i} = importdata(fullfile(pathName,SubFolder{i}));
end
%convert into daily averages
rawData2=cell2mat(rawData1);
%create one matrix for entire data set
altered=reshape(rawData2,24,(size(rawData2,2)*365));
%convert into daily values
altered=mean(altered)';
%take the average for each day
altered=reshape(altered,365,size(rawData2,2));
%convert back into original format
My problem lies in trying to convert the data back into the same format as 'rawData1' which was a cell for each variable (where each variable is denoted by 'SubFolder'. The reason for doing this is that all but one of the variables are vectors, where the remaining variable is a matrix (8760*11).
So, an example of this would be:
clear all
cell_1 = rand(8760,1);
cell_2 = rand(8760,1);
cell_3 = rand(8760,1);
cell_4 = rand(8760,1);
cell_5 = rand(8760,1);
cell_6 = rand(8760,11);
cell_7 = rand(8760,1);
cell_8 = rand(8760,1);
cell_9 = rand(8760,1);
data = {cell_1,cell_2,cell_3,cell_4,cell_5,cell_6,cell_7,cell_8,cell_9};
Where I need to convert each cell in 'data' from hourly values into daily averages (i.e. 365 rows).
Any advice would be much appreciated.
I think this does what you want.
data = cellfun(#(x) reshape(mean(reshape(x,24,[]))',365,[]),data,'uniformoutput',false);
However that is kind of confusing so I will explain a little.
This part mean(reshape(x,24,[]))' inside of the cellfun will reshape each cell in data into a 24 by 365, compute the mean, then turn it back into a single column. This works fine when the original data only has 1 column ... but for cell_6 with 11 columns it puts all the data end to end. So I added an addition reshape(...) wrapper around the mean(...) part to put it back into the original 11 columns ... or more precises N columns that are 365 rows in length.
Note: This is going to give you errors if you ever have data sets dimensions are not 8760 by X.