Ignoring vectors containing NaN entries in Matlab calculations - matlab

This code prices bonds according to the fitSvensson function. How do I get Matlab to ignore NaN values in the CleanPrice vector when a date is selected for which some bonds have a NaN entry for a missing price. How can I get it to ignore that bond altogether when deriving the zero curve? It seems many solutions to NaNs resort to interpolation or setting to zero, but this will lead to an erroneous curve.
Maturity=gcm3.data.CouponandMaturity(1:36,2);
[r,c]=find(gcm3.data.CleanPrice==datenum('11-May-2012'));
row=r
SettleDate=gcm3.data.CouponandMaturity(row,3);
Settle = repmat(SettleDate,[length(Maturity) 1]);
CleanPrices =transpose(gcm3.data.CleanPrice(row,2:end));
CouponRate = gcm3.data.CouponandMaturity(1:36,1);
Instruments = [Settle Maturity CleanPrices CouponRate];
PlottingPoints = gcm3.data.CouponandMaturity(1,2):gcm3.data.CouponandMaturity(36,2);
Yield = bndyield(CleanPrices,CouponRate,Settle,Maturity);
SvenssonModel = IRFunctionCurve.fitSvensson('Zero',SettleDate,Instruments)
ParYield=SvenssonModel.getParYields(Maturity);
The Data looks like this where each column is a bond, column 1 is the dates and the elements the clean price. As you can see, the first part of the data contains lots of NaNs for bonds yet to have prices. After a point all bonds have prices but unfortunately there are instances where one or two day's prices are missing.
Ideally, if a NaN is present I would like it to ignore that bond on that date if possible as the more curves generated (irrespective of number of bonds used) the better. If this is not possible then ignoring that date is an option but will result in many curves not generating.

This is a general solution to your problem. I don't have that toolbox on my work computer so I can't test whether it works with the IRFunctionCurve.fitSvensson command
[row,~]=find(gcm3.data.CleanPrice(:,1)==datenum('11-May-2012'));
col_set=find(~isnan(gcm3.data.CleanPrice(row,2:end)));
CleanPrices=transpose(gcm3.data.CleanPrice(row,col));

Related

Datetick take into account NaN in plot

I have a series y that contains values, some of which are NaN some numeric (double).
The series has an associated vector d which contains the datenum dates.
Example:
y=[NaN(5,1); rand(10,1)]
d=now-14:now
When I run:
plot(d,y)
I get the graph I want; the NaN observations are taken out.
However, when I run:
plot(d,y); datetick
then my graph starts from the beginning and takes into account all the observations (even when y is a NaN).
How can I prevent this from happening?
From the documentation we can see that there is an easy way (shown below) to preserve the current axes limits.
plot(d,y);
datetick('keeplimits');
The 'keeplimits' argument does exactly what it suggests, maintaining the x-axis limits whilst converting the tick values to dates. You may also want to pass 'keepticks' to preserve tick mark locations.
The behaviour you describe seems contrary to the docs:
datetick selects a label format based on the minimum and maximum limits of the specified axis.
From this statement I would expect the values to remain the same, but there is obviously something about the way the limits are handled internally which means the NaN points are included. At least we are given a simple work around!

How to calculate the "rest value" of a plot?

Didn't know how to paraphrase the question well.
Function for example:
Data:https://www.dropbox.com/s/wr61qyhhf6ujvny/data.mat?dl=0
In this case how do I calculate that the rest point of this function is ~1? I have access to the vector that makes the plot.
I guess the mean is an approximation but in some cases it can be pretty bad.
Under the assumption that the "rest" point is the steady-state value in your data and the fact that the steady-state value happens the majority of the times in your data, you can simply bin all of the points and use each unique value as a separate bin. The bin with the highest count should correspond to the steady-state value.
You can do this by a combination of histc and unique. Assuming your data is stored in y, do this:
%// Find all unique values in your data
bins = unique(y);
%// Find the total number of occurrences per unique value
counts = histc(y, bins);
%// Figure out which bin has the largest count
[~,max_bin] = max(counts);
%// Figure out the corresponding y value
ss_value = bins(max_bin);
ss_value contains the steady-state value of your data, corresponding to the most occurring output point with the assumptions I laid out above.
A minor caveat with the above approach is that this is not friendly to floating point data whose unique values are generated by floating point values whose decimal values beyond the first few significant digits are different.
Here's an example of your data from point 2300 to 2320:
>> format long g;
>> y(2300:2320)
ans =
0.99995724232555
0.999957488454868
0.999957733165346
0.999957976465197
0.999958218362579
0.999958458865564
0.999958697982251
0.999958935720613
0.999959172088623
0.999959407094224
0.999959640745246
0.999959873049548
0.999960104014889
0.999960333649014
0.999960561959611
0.999960788954326
0.99996101464076
0.999961239026462
0.999961462118947
0.999961683925704
0.999961904454139
Therefore, what I'd recommend is to perhaps round so that the first 5 or so significant digits are maintained.
You can do this to your dataset before you continue:
num_digits = 5;
y_round = round(y*(10^num_digits))/(10^num_digits);
This will first multiply by 10^n where n is the number of digits you desire so that the decimal point is shifted over by n positions. We round this result, then divide by 10^n to bring it back to the scale that it was before. If you do this, for those points that were 0.9999... where there are n decimal places, these will get rounded to 1, and it may help in the above calculations.
However, more recent versions of MATLAB have this functionality already built-in to round, and you can just do this:
num_digits = 5;
y_round = round(y,num_digits);
Minor Note
More recent versions of MATLAB discourage the use of histc and recommend you use histcounts instead. Same function definition and expected inputs and outputs... so just replace histc with histcounts if your MATLAB version can handle it.
Using the above logic, you could also use the median too. If the majority of data is fluctuating around 1, then the median would have a high probability that the steady-state value is chosen... so try this too:
ss_value = median(y_round);

Preserving matrix columns using Matlab brush/select data tool

I'm working with matrices in Matlab which have five columns and several million rows. I'm interested in picking particular groups of this data. Currently I'm doing this using plot3() and the brush/select data tool.
I plot the first three columns of the matrix as X,Y, Z and highlight the matrix region I'm interested in. I then use the brush/select tool's "Create variable" tool to export that region as a new matrix.
The problem is that when I do that, the remaining two columns of the original, bigger matrix are dropped. I understand why- they weren't plotted and hence the figure tool doesn't know about them. I need all five columns of that subregion though in order to continue the processing pipeline.
I'm adding the appropriate 4th and 5th column values to the exported matrix using a horrible nested if loop approach- if columns 1, 2 and 3 match in both the original and exported matrix, attach columns 4/5 of the original matrix to the exported one. It's bad design and agonizingly slow. I know there has to be a Matlab function/trick for this- can anyone help?
Thanks!
This might help:
1. I start with matrix 1 with columns X,Y,Z,A,B
2. Using the brush/select tool, I create a new (subregion) matrix 2 with columns X,Y,Z
3. I then loop through all members of matrix 2 against all members of matrix 1. If X,Y,Z match for a pair of rows, I append A and B
from that row in matrix 1 to the appropriate row in matrix 2.
4. I become very sad as this takes forever and shows my ignorance of Matlab.
If I understand your situation correctly here is a simple way to do it:
Assuming you have a matrix like so: M = [A B C D E] where each letter is a Nx1 vector.
You select a range, this part is not really clear to me, but suppose you can create the following:
idxA,idxB and idxC, that are 1 if they are in the region and 0 otherwise.
Then you can simply use:
M(idxA&idxB&idxC,:)
and you will get the additional two columns as well.

How to compare different distribution means with reference truth value in Matlab?

I have production (q) values from 4 different methods stored in the 4 matrices. Each of the 4 matrices contains q values from a different method as:
Matrix_1 = 1 row x 20 column
Matrix_2 = 100 rows x 20 columns
Matrix_3 = 100 rows x 20 columns
Matrix_4 = 100 rows x 20 columns
The number of columns indicate the number of years. 1 row would contain the production values corresponding to the 20 years. Other 99 rows for matrix 2, 3 and 4 are just the different realizations (or simulation runs). So basically the other 99 rows for matrix 2,3 and 4 are repeat cases (but not with exact values because of random numbers).
Consider Matrix_1 as the reference truth (or base case ). Now I want to compare the other 3 matrices with Matrix_1 to see which one among those three matrices (each with 100 repeats) compares best, or closely imitates, with Matrix_1.
How can this be done in Matlab?
I know, manually, that we use confidence interval (CI) by plotting the mean of Matrix_1, and drawing each distribution of mean of Matrix_2, mean of Matrix_3 and mean of Matrix_4. The largest CI among matrix 2, 3 and 4 which contains the reference truth (or mean of Matrix_1) will be the answer.
mean of Matrix_1 = (1 row x 1 column)
mean of Matrix_2 = (100 rows x 1 column)
mean of Matrix_3 = (100 rows x 1 column)
mean of Matrix_4 = (100 rows x 1 column)
I hope the question is clear and relevant to SO. Otherwise please feel free to edit/suggest anything in question. Thanks!
EDIT: My three methods I talked about are a1, a2 and a3 respectively. Here's my result:
ci_a1 =
1.0e+008 *
4.084733001497999
4.097677503988565
ci_a2 =
1.0e+008 *
5.424396063219890
5.586301025525149
ci_a3 =
1.0e+008 *
2.429145282593182
2.838897116739112
p_a1 =
8.094614835195452e-130
p_a2 =
2.824626709966993e-072
p_a3 =
3.054667629953656e-012
h_a1 = 1; h_a2 = 1; h_a3 = 1
None of my CI, from the three methods, includes the mean ( = 3.454992884900722e+008) inside it. So do we still consider p-value to choose the best result?
If I understand correctly the calculation in MATLAB is pretty strait-forward.
Steps 1-2 (mean calculation):
k1_mean = mean(k1);
k2_mean = mean(k2);
k3_mean = mean(k3);
k4_mean = mean(k4);
Step 3, use HIST to plot distribution histograms:
hist([k2_mean; k3_mean; k4_mean]')
Step 4. You can do t-test comparing your vectors 2, 3 and 4 against normal distribution with mean k1_mean and unknown variance. See TTEST for details.
[h,p,ci] = ttest(k2_mean,k1_mean);
EDIT : I misinterpreted your question. See the answer of Yuk and following comments. My answer is what you need if you want to compare distributions of two vectors instead of a vector against a single value. Apparently, the latter is the case here.
Regarding your t-tests, you should keep in mind that they test against a "true" mean. Given the number of values for each matrix and the confidence intervals it's not too difficult to guess the standard deviation on your results. This is a measure of the "spread" of your results. Now the error on your mean is calculated as the standard deviation of your results divided by the number of observations. And the confidence interval is calculated by multiplying that standard error with appx. 2.
This confidence interval contains the true mean in 95% of the cases. So if the true mean is exactly at the border of that interval, the p-value is 0.05 the further away the mean, the lower the p-value. This can be interpreted as the chance that the values you have in matrix 2, 3 or 4 come from a population with a mean as in matrix 1. If you see your p-values, these chances can be said to be non-existent.
So you see that when the number of values get high, the confidence interval becomes smaller and the t-test becomes very sensitive. What this tells you, is nothing more that the three matrices differ significantly from the mean. If you have to choose one, I'd take a look at the distributions anyway. Otherwise the one with the closest mean seems a good guess. If you want to get deeper into this, you could also ask on stats.stackexchange.com
Your question and your method aren't really clear :
Is the distribution equal in all columns? This is important, as two distributions can have the same mean, but differ significantly :
is there a reason why you don't use the Central Limit Theorem? This seems to me like a very complex way of obtaining a result that can easily be found using the fact that the distribution of a mean approaches a normal distribution where sd(mean) = sd(observations)/number of observations. Saves you quite some work -if the distributions are alike! -
Now if the question is really the comparison of distributions, you should consider looking at a qqplot for a general idea, and at a 2-sample kolmogorov-smirnov test for formal testing. But please read in on this test, as you have to understand what it does in order to interprete the results correctly.
On a sidenote : if you do this test on multiple cases, make sure you understand the problem of multiple comparisons and use the appropriate correction, eg. Bonferroni or Dunn-Sidak.

Problem using the find function in MATLAB

I have two arrays of data that I'm trying to amalgamate. One contains actual latencies from an experiment in the first column (e.g. 0.345, 0.455... never more than 3 decimal places), along with other data from that experiment. The other contains what is effectively a 'look up' list of latencies ranging from 0.001 to 0.500 in 0.001 increments, along with other pieces of data. Both data sets are X-by-Y doubles.
What I'm trying to do is something like...
for i = 1:length(actual_latency)
row = find(predicted_data(:,1) == actual_latency(i))
full_set(i,1:4) = [actual_latency(i) other_info(i) predicted_info(row,2) ...
predicted_info(row,3)];
end
...in order to find the relevant row in predicted_data where the look up latency corresponds to the actual latency. I then use this to created an amalgamated data set, full_set.
I figured this would be really simple, but the find function keeps failing by throwing up an empty matrix when looking for an actual latency that I know is in predicted_data(:,1) (as I've double-checked during debugging).
Moreover, if I replace find with a for loop to do the same job, I get a similar error. It doesn't appear to be systematic - using different participant data sets throws it up in different places.
Furthermore, during debugging mode, if I use find to try and find a hard-coded value of actual_latency, it doesn't always work. Sometimes yes, sometimes no.
I'm really scratching my head over this, so if anyone has any ideas about what might be going on, I'd be really grateful.
You are likely running into a problem with floating point comparisons when you do the following:
predicted_data(:,1) == actual_latency(i)
Even though your numbers appear to only have three decimal places of precision, they may still differ by very small amounts that are not being displayed, thus giving you an empty matrix since FIND can't get an exact match.
One feature of floating point numbers is that certain numbers can't be exactly represented, since they aren't an integer power of 2. This occurs with the numbers 0.1 and 0.001. If you repeatedly add or multiply one of these numbers you can see some unexpected behavior. Amro pointed out one example in his comment: 0.3 is not exactly equal to 3*0.1. This can also be illustrated by creating your look-up list of latencies in two different ways. You can use the normal colon syntax:
vec1 = 0.001:0.001:0.5;
Or you can use LINSPACE:
vec2 = linspace(0.001,0.5,500);
You'd think these two vectors would be equal to one another, but think again!:
>> isequal(vec1,vec2)
ans =
0 %# FALSE!
This is because the two methods create the vectors by performing successive additions or multiplications of 0.001 in different ways, giving ever so slightly different values for some entries in the vector. You can take a look at this technical solution for more details.
When comparing floating point numbers, you should therefore do your comparisons using some tolerance. For example, this finds the indices of entries in the look-up list that are within 0.0001 of your actual latency:
tolerance = 0.0001;
for i = 1:length(actual_latency)
row = find(abs(predicted_data(:,1) - actual_latency(i)) < tolerance);
...
The topic of floating point comparison is also covered in this related question.
You may try to do the following:
row = find(abs(predicted_data(:,1) - actual_latency(i))) < eps)
EPS is accuracy of floating-point operation.
Have you tried using a tolerance rather than == ?