Converting Data to Timeseries - MATLAB - matlab

I have Excel data in the following format
Ticker Date Price
GOOG 1/1/12 100
GOOG 1/2/12 200
AAPL 1/1/12 50
etc.
I would like to convert this to a time series collection (or just a matrix of data) in the following format:
Date GOOG AAPL .... (variable number of tickers)
1/1/12 100 50
As this would be easier to use in Matlab to do some calculations on it.
The way I've done this in the past, and I dont believe it is the most efficient, was to run a unique(tickers) function to check how many tickers we have, then chop off the data accordingly in a for loop. I think this is very inefficient (and ugly) for larger data sets. I was hoping someone would have a better suggestion?
Here's a sample of previous attempts I've done on similar data which assumes the data are sorted by ticker:
[uniqueSecurities, uniqueIndex] = unique(Tickers);
numberSecurities = length(uniqueSecurities);
The above code would now tell you at which location does a new ticker start (at every uniqueIndex entry).
now assuming there is the same number of observations for each ticker, you can chop off the data in this manner:
numberObservations = whatever
j = 0;
for secIndex = 1:numberSecurities
NewDataMatrix(:,secIndex) = Prices(j : j + numberObservations);
j = j + numbrtObservations;
end
Now if you have a variable number of observations for each security, instead of jumping by "numberObservations" intervals, you use the uniqueIndex I defined above, and, in a similar manner, chop everything with the indices between uniqueIndex(k) and uniqueIndex(k+1).
The reason I'm posting is because I dont believe I am being very efficient, and in addition is there some default MATLAB way to doing this? As I understand, most databases will give me data in the above format (not the best of formats!) and I dont have any control over the format unfortunately.

Related

More efficient way to search a large matrix in MATLAB?

I have a code that does what I want but it is too slow because I have a very large mat file with a matrix (33 gigabyte) that I need to search for particular values and extract those.
The file that I'm searching has the following structure:
reporter sector partner year ave
USA all CAN 2007 0.060026126
USA all CAN 2011 0.0637898418
...
This goes on for millions of rows. I want to extract the last (5th) column value for particular reporter and partner values (sector and year are fixed). In actuality there are more fixed values that I have taken out for the sake of simplicity but this might slow down my code even more. The country_codes and partner values need to vary and are looped for that reason.
The crucial part of my code is the following:
for i = 1:length(country_codes)
for g = [1:length(partner)]
matrix(i,g) = big_file(...
ismember(GTAP_data(:,1), country_codes(i)) & ... % reporter
ismember(GTAP_data(:,2), 'all') & ...sector
ismember(GTAP_data(:,3), partner(g)) & ... partner
ismember([GTAP_data{:,4}]', 2011) & ... year
,5); % ave column
end
end
In other words, the code goes through the million rows and finds just the right value by applying ismember with logical & on everything.
Is there a faster way to do this than using ismember? Can someone assist?
So what I see is you build a big table out of the data in different files.
It seems your values are text-based. That takes up more memory. "USA" already takes up three bytes of memory. If you have less then 255 countries to concider, you could store them as only one byte in uint8 format.
If you can store all columns as a value between 0 and 255 you can make a uint8 matrix that can be indexed very fast.
As an example:
%demo
GTAP_regions={'USA','NL','USA','USA','NL','GB','NL','USA','Korea Republic of','GB','NL','USA','Korea Republic of'};
S=whos('GTAP_regions');
S.bytes
GTAP_regions requires 1580 bytes. Now we convert it.
GTAP_regions_list=GTAP_regions(1);
GTAP_regions_uint=uint8(1);
for ct = 2:length(GTAP_regions)
I=ismember(GTAP_regions_list,GTAP_regions(ct));
if ~any(I)
GTAP_regions_list(end+1)=GTAP_regions(ct);
else
GTAP_regions_uint(end+1)=uint8(find(I));
end
end
S=whos('GTAP_regions_list');
S.bytes
S=whos('GTAP_regions_uint');
S.bytes
GTAP_regions_uint we need to use to do indexing, and it is now only 10 bytes and will be very fast to analyse.
GTAP_regions_list we need to use to find what index value belongs to what country, is only 496 bytes.
You can also do this for sector, partner and year, depending on the range of years. If it is no more than 255 different years it will work. Otherwise you could store it as uint16 and have 65535 possible options.

Logically index Data based upon two date time arrays in matlab

I will jump straight into a minimal example as I find this difficult to put into words. I have the following example:
Data.Startdate=[datetime(2000,1,1,0,0,0) datetime(2000,1,2,0,0,0) datetime(2000,1,3,0,0,0) datetime(2000,1,4,0,0,0)];
Data.Enddate=[datetime(2000,1,1,24,0,0) datetime(2000,1,2,24,0,0) datetime(2000,1,3,24,0,0) datetime(2000,1,4,24,0,0)];
Data.Value=[0.5 0.1 0.2 0.4];
Event_start=[datetime(2000,1,1,12,20,0) datetime(2000,1,1,16,0,0) datetime(2000,1,4,8,0,0)];
Event_end=[datetime(2000,1,1,14,20,0) datetime(2000,1,1,23,0,0) datetime(2000,1,4,16,0,0)];
What I want to do is add a flag to the Data structure (say a 1) if any time between Data.Startdate and Data.Enddate falls between Event_start and Event_end. In the example above Data.Flag would have have the values 1 0 0 1 because from the Event_start and Event_end vectors you can see there are events on January 1st and January 4th. The idea is that I will use this flag to process the data further.
I am sure this is straightforward but would appreciate any help you can give.
I would convert the dates to numbers using datenum, which then allows fairly convenient comparisons using bsxfun:
isStartBeforeEvent = bsxfun(#gt,datenum(Event_start)',datenum(Data.Startdate));
isEndAfterEvent = bsxfun(#lt,datenum(Event_end)',datenum(Data.Enddate));
flag = any(isStartBeforeEvent & isEndAfterEvent, 1)

MATLAB Loop Programming

I've been stuck on a MATLAB coding problem where I needed to create market weights for many stocks from a large data file with multiple days and portfolios.
I received help from an expert the other day using 'nested loops' it worked, but I don't understand what he has done in the final line. I was wondering if anyone could shed some light and provide an explanation of the last coding line.
xp = x (where x = market value)
dates=unique(x(:,1)); (finds the unique dates in the data set Dates are column 1)
for i=1:size(dates,1) (creates an empty matrix to fill the data in)
for j=5:size(xp,2)
xp(xp(:,1)==dates(i),j)=xp(xp(:,1)==dates(i),j)./sum(xp(xp(:,1)==dates(i),j)); (help???)
end
end
Any comment are much appreciated!
To understand the code, you have to understand the colon operator, logical indexing and the difference between / and ./. If any of these is unclear, please look it up in the documentation.
The following code does the same, but is easier to read because I separated each step into a single line:
dates=unique(x(:,1));
%iterates over all dates. size(dates,1) returns the number of dates
for i=1:size(dates,1)
%iterates over the fifth to last column, which contains the data that will be normalised.
for j=5:size(xp,2)
%mdate is a logical vector, which is used to select the rows with the currently processed date.
mdate=(xp(:,1)==dates(i))
%s is the sums up all values in column j same date
s=sum(xp(mdate,j))
%divide all values in column j with the same date by s, which normalises to 1
xp(mdate,j)=xp(mdate,j)./s;
end
end
With this code, I suggest to use the debugger and step through the code.

indexing to find corresponding number

I have a time series of measurements taken at different depths of a water column. I have divided these into individual cells (for later) and require some help on how to complete the following: e.g.
time = [733774,733774,733775,733775,733775,733776,733776];
bthD = [20,10,0,15,10,20,10];
bthA = (1000:100:1600);
%Hypsographic
Hypso = [(10:1:20)',(1000:100:2000)'];
d = [1,1.3,1,2.5,2.5,1,1.2];
data = horzcat(time',bthD',d');
uniqueTimes = unique(time);
counts = hist(time,uniqueTimes);
newData = mat2cell(data,counts,length(uniqueTimes));
So, in newData I have three cells, that correspond to different days of measurements, in each cell I have newData(:,1) being time, newData(:,2) being depth, and newData(:,3) being the measurement. I would like to find what the area is at each depth in the cells, the area at different depths is given in the variable 'Hypso'.
How could I achieve this?
Your problem formulation is excellent! Very easy to understand what you need here. All you need is the function interp1. Use the first column of Hypso, I assume, as your depth, and the second column as the area. You can use the vectorized ability of the interp1 function to find all values in one call:
areaAtDepth = interp1(Hypso(:,1),Hypso(:,2),bthD)
areaAtDepth =
Columns 1 through 6
2000 1000 NaN 1500 1000 2000
Column 7
1000
You'll notice the Nan in the third column of the output. This is because it's associated depth, 0, is outside the range of the data, or support of the data I believe. You'll need to decide what you want to do when data is outside the range, or perhaps it never should be, so an error should be logged; it's up to you! Let me know if you have any more questions!

Comparing dates and filling in gap times in matlab

I have a data file which contains time data. The list is quite long, 100,000+ points. There is data every 0.1 seconds, and the time stamps are so:
'2010-10-10 12:34:56'
'2010-10-10 12:34:56.1'
'2010-10-10 12:34:56.2'
'2010-10-10 12:34:53.3'
etc.
Not every 0.1 second interval is necessarily present. I need to check whether a 0.1 second interval is missing, then insert this missing time into the date vector. Comparing strings seems unnecessarily complicated. I tried comparing seconds since midnight:
date_nums=datevec(time_stamps);
secs_since_midnight=date_nums(:,4)*3600+date_nums(:,5)*60+date_nums(:,6);
comparison_secs=linspace(0,86400,864000);
res=(ismember(comparison_secs,secs_since_midnight)~=1);
However this approach doesn't work due to rounding errors. Both the seconds since midnight and the linspace of the seconds to compare it to never quite equal up (due to the tenth of a second resolution?). The intent is to later do an fft on the data associated with the time stamps, so I want as much uniform data as possible (the data associated with the missing intervals will be interpolated). I've considered blocking it into smaller chunks of time and just checking the small chunks one at a time, but I don't know if that's the best way to go about it. Thanks!
Multiply your numbers-of-seconds by 10 and round to the nearest integer before comparing against your range.
There may be more efficient ways to do this than ismember. (I don't know offhand how clever the implementation of ismember is, but if it's The Simplest Thing That Could Possibly Work then you'll be taking O(N^2) time that way.) For instance, you could use the timestamps that are actually present (as integer numbers of 0.1-second intervals) as indices into an array.
Since you're concerned with missing data records and not other timing issues such as a drifting time channel, you could check for missing records by converting the time values to seconds, doing a DIFF and finding those first differences that are greater than some tolerance. This would tell you the indices where the missing records should go. It's then up to you to do something about this. Remember, if you're going to use this list of indices to fill the gaps, process the list in descending index order since inserting the records will cause the index list to be unsynchronized with the data.
>> time_stamps = now:.1/86400:now+1; % Generate test data.
>> time_stamps(randi(length(time_stamps), 10, 1)) = []; % Remove 10 random records.
>> t = datenum(time_stamps); % Convert to date numbers.
>> t = 86400 * t; % Convert to seconds.
>> index = find(diff(t) > 1.999 * 0.1)' + 1 % Find missing records.
index =
30855
147905
338883
566331
566557
586423
642062
654682
733641
806963