I've been stuck on a MATLAB coding problem where I needed to create market weights for many stocks from a large data file with multiple days and portfolios.
I received help from an expert the other day using 'nested loops' it worked, but I don't understand what he has done in the final line. I was wondering if anyone could shed some light and provide an explanation of the last coding line.
xp = x (where x = market value)
dates=unique(x(:,1)); (finds the unique dates in the data set Dates are column 1)
for i=1:size(dates,1) (creates an empty matrix to fill the data in)
for j=5:size(xp,2)
xp(xp(:,1)==dates(i),j)=xp(xp(:,1)==dates(i),j)./sum(xp(xp(:,1)==dates(i),j)); (help???)
end
end
Any comment are much appreciated!
To understand the code, you have to understand the colon operator, logical indexing and the difference between / and ./. If any of these is unclear, please look it up in the documentation.
The following code does the same, but is easier to read because I separated each step into a single line:
dates=unique(x(:,1));
%iterates over all dates. size(dates,1) returns the number of dates
for i=1:size(dates,1)
%iterates over the fifth to last column, which contains the data that will be normalised.
for j=5:size(xp,2)
%mdate is a logical vector, which is used to select the rows with the currently processed date.
mdate=(xp(:,1)==dates(i))
%s is the sums up all values in column j same date
s=sum(xp(mdate,j))
%divide all values in column j with the same date by s, which normalises to 1
xp(mdate,j)=xp(mdate,j)./s;
end
end
With this code, I suggest to use the debugger and step through the code.
Related
I had previously wrote some code to split 3 columns into 4, however the code was very inefficient and time consuming. As I am working with millions of rows it wasn't suitable. (Below is my previous code)
tline = fgetl(fid);
ID=tline(1:4);
IDN = str2double(ID);
Day=tline(6:8);
DayN = str2double(Day);
HalfHour=tline(9:10);
HalfHourN = str2double(HalfHour);
Usage=tline(12:end);
UsageN = str2double(Usage);
There must be a more efficient and quicker way of doing this?
Going back to basics, I have produced a x by 3 matrix. but require an x by 4 matrix
To show what I am trying to do, examining one row -
I am trying to change
1001 36501 1005
to
1001 365 01 1005
Any help would be much appreciated!
Edit:
The second column I am trying to divide into two, is always composed of 5 characters. I am trying to get the first 3 characters into their own column, likewise for the remaining characters.
What might take time in your case is actually the use of the str2double function. It is known that this built-in function becomes very slow when the data set is large. You might try to get rid of it if possible.
you can use modulo
ans = (36501 - mod(36501,100))/100
This would give you 365
if you want the 1, it is mod(36501,100)
so this would effectively split your second column into 2 different numbers, you can then re name them etc.
hmmm on second thoughts, if all your numbers on your second column are 5 digits, this can be extremely efficient, since mod is computed in matlab by b = a - m.*floor(a./m);
check http://uk.mathworks.com/help/matlab/ref/mod.html it should work for vectors (i.e. your second column)
Im new to Matlab and this seems to be beyond me. Appreciate the help and thanks in advance.
Basically, I have a multiple columns dataset with column headers. Column numbers could vary from dataset to dataset.
Need to iterate through all the combinations of columns (eg A+B, A+C....B+C, B+D...etc) and run a formula (in this instance it is a correlation formula but could be another formula subsequently).
If particular combination returns "true", then the column headers of the pair will be returned.
Would appreciate if you could point me in the right direction.
Thanks in advance.
Use nchoosek to get all pairs of columns:
pairs_columns = nchoosek(1:m, 2);
pairs = {};
for pair = 1:size(pairs_columns,1)
flag = your_correlation_test(data(:,pairs_columns(pair,1)), data(:,pairs_columns(pair,2)));
if flag
pairs{end+1,1} = data_header(pairs_columns(pair,1));
pairs{end,2} = data_header(pairs_columns(pair,2)); %// Note that you don't need end+1 anymore as the previous line will have already increased the number of rows in the vector
end
end
m is your number of columns
your_correlation_test is your test function that returns a Boolean result
data is your dataset (which I'm assuming you can index by column number?)
data_header is a place holder for whatever the correct way to get the header is from your dataset based on the column number. Sorry I'm not very familiar with datasets in Matlab
I'm reading in a csv file that is about 80MB - data_O3. It's about 250,000 x 5 in size. I created E, which is a little bit larger because it has all the days (data_O3 is missing some days). I want to compare the two so that if the date (saved in variable d3) and siteID (d4) are the same, the data point (column 5) is placed in E.
for j = 1:size(data_O3,1)
E(strcmp(d3,data_O3{j,3})&d4 == data_O3{j,4},5) = data_O3(j,5);
end
This script works fine, but for some reason, running it takes longer than expected. I've run the same code for other data that were only slightly smaller with no problem. Is this an issue with the strcmp code or something else?
The script and files used can be found here: https://www.dropbox.com/sh/7bzq3m1ixfeuhu6/i4oOvxHPkn
There are certainly see a number of ways to speed this up significantly.
First of all, read in all numeric data in as numbers. Matlab is not optimized to work with strings, and even cells should generally be avoided as much as possible. If you want to keep everything as strings, use another language (python or perl)
Once you have the state, county and site read in as numbers, then create a number instead of a string for the siteID. One approach would be to use the formula:
siteID = siteNum + 1e4*countyCode + 1e7*stateCode
That would generate unique siteIDs for all sites.
Use datenum to convert the date field into a number.
You are now in a position where the data_O3 defined on line 79 can be a purely numeric array (no cells!), as can your E matrix. That alone will make the process many times faster.
You also might want to define the E as something other than NaN. Maybe give it values of -1.
There may be more optimizations you can do in the comparison, but do the above first and I expect you will see a huge improvement.
I have results which are 6 columns long however have been printed as 2 then 3 beneath then 1 beneath that! There are hundreds of lines and matlab will not except the structure of the matrix as it is now. Is there any way to tell matlab i want the first 5 results in their own columns then continuing down the rows after that?
My results appear as follows:
0.5 0
0.59095535915335684063 -0.59095535915335395405 -5.89791913085569763
33e-08
... repeated alot
thansk so much, em xx
I would just do a format shortE before you process the output, this will give you everything in scientific notation with 4 digits after the decimal. That 'should' allow you to fit your columns all in one line, so you don't have to deal with the botched output.
In general you should not want the output to be in a too specific format, but suppose you have this matrix:
M =[0.5 0 0.59095535915335684063 -0.59095535915335395405 -5.89791913085569763 33e-08];
To make it an actual matrix I will repeat it a bit:
M = repmat(M,10,1);
Now you can ensure that all six columns will fit on a normal screen by using the format.
format short
Try help format to find more options. Now simply showing the matrix will put all columns next to eachother. If you want one column below, the trick is to reduce your windows width untill it can only hold five columns. Matlab will now print the last column below the first.
M % Simply show the matrix
% Now reduce your window size
M % Simply show it again
This should help you display the numbers in matlab, if you want to process them further you can consider to write them to a file instead. Try help xlswrite for a simple solution.
Basically I have a <96x659 double> matrix and I want to extract 1st and 2nd column , then 8th and 9th, then 15th and 16th column and so on..
So I want each 2 columns in a step of 6 . I hope I was clear enough. I'm newbie in matlab .
Thanks in advance!
All you really need to do is construct the list of columns you want:
columns = [1:7:size(matrix,2)+1, 2:7:size(matrix,2)+1];
submat = matrix(:, columns);
Keep in mind this will not necessarily return the columns in the order you want. If you want the columns in ascending order, you could substitute
submat = matrix(:, sort(columns));
Matlab Summary and Tutorial
This is a pretty decent introduction, if the Matlab documentation itself appears to be a little dense. Go through the examples, try them out.