Figuring out where to add text in a MATLAB histogram - matlab

I have a MATLAB histogram graph produced with some data, with some 50 bins. I now need to insert a line of text into the graph, at any place where it wouldn't tangle with the histogram bars. The text is basically 'Period of data used: mmm dd to mmm dd' (I mention this to give an idea of the width required and where the text can be split if necessary).
One method I considered was finding out a series of contiguous histogram bins where the freq (y axis) remains less than 90% of the maximum of all frequencies; then, the text can be printed at the x position starting at the first of those bins near the top of the graph.
Is this a good way of going about it? If so, how do I compute this contiguous series of bins without looping around?
Or is there a better way of placing this text adaptively according to the data?
Edit: Due to other considerations, the number of histogram bins is not a fixed 50 any more, but rather xmax/20 where xmax is the maximum x-axis value. Algorithms that depend on working on aggregates of a number of bins might need to take this variability into account, when calculating that number.

I think the simplest way would be to use a multiline title, optionally along with TeX formatting to de-emphasise the additional info. To make a multiline title, pass a cell array of strings like this:
title({'\fontsize{16}Actual Title';'\fontsize{8}other info'})
Being consistent across the histograms, I think this would look tidier than having text on the graph itself that might move around.

Related

Matlab - Error bars for (large) noisy data sets

I have ten large linear arrays (n elements) such as
A = [ A1 A2....An ]
B = [ B1 B2....Bn ]
....
J = [ J1 J2....Jn ]
I can make an arithmentic mean of these arrays by adding them and dividing by ten and this reduces the noise substantially and shows the trend I am looking for. (note, that often I have more or less than ten data sets, but this is representative. Also, n varies, but is generally 10,000s of data points)
What I would like to do is plot this average with error bars that represent the noise in the original ten arrays. The arrays are large, so maybe error bars at sensible increments (say ten error bars across the entire range where the deviation from the average is greatest).
The image shows 10 noisy data sets plotted as grey lines with the mean as a black line.
thanks
I have come to a rather laborious solution to this problem by writing a (what seems) lengthy piece of code.
The code takes all the input arrays and creates two new arrays. One which is the maximum y value for each x and one which is the miniumum. This is done with the max and min functions in matlab.
The minimum is substracted from the maximum to create an array of the magnitudes of the "error" at each value of x.
Then every nth value of the error array is plotted as an error bar on top of the arithmetic mean value of all the original input arrays.
It's a fix to the problem - and the screenshot shows the result - but I was wondering if there is a more elegant "built in" solution that does this in one shot.

Datetick take into account NaN in plot

I have a series y that contains values, some of which are NaN some numeric (double).
The series has an associated vector d which contains the datenum dates.
Example:
y=[NaN(5,1); rand(10,1)]
d=now-14:now
When I run:
plot(d,y)
I get the graph I want; the NaN observations are taken out.
However, when I run:
plot(d,y); datetick
then my graph starts from the beginning and takes into account all the observations (even when y is a NaN).
How can I prevent this from happening?
From the documentation we can see that there is an easy way (shown below) to preserve the current axes limits.
plot(d,y);
datetick('keeplimits');
The 'keeplimits' argument does exactly what it suggests, maintaining the x-axis limits whilst converting the tick values to dates. You may also want to pass 'keepticks' to preserve tick mark locations.
The behaviour you describe seems contrary to the docs:
datetick selects a label format based on the minimum and maximum limits of the specified axis.
From this statement I would expect the values to remain the same, but there is obviously something about the way the limits are handled internally which means the NaN points are included. At least we are given a simple work around!

How do I plot values in an array vs the number of times those values appear in Matlab?

I have a set of ages (over 10000 of them) and I want to plot a graph with the age from 20 to 100 on the x axis and then the number of times each of those ages appears in the data on the y axis. I have tried several ways to do this and I can't figure it out. I also have some other data which requires me to plot values vs how many times they occur so any advice on how to do this would be much appreciated.
I'm quite new to Matlab so it would be great if you could explain how things in your answer work rather than just typing out some code.
Thanks.
EDIT:
So I typed histogram(Age, 80) because as I understand that will plot the values in Age on a histogram split up into 80 bars (1 for each age). Instead I get this:
The bars aren't aligned and it's clearly not 1 per age nor has it plotted the number of times each age occurs on the y axis.
You have to use histogram(), and that's correct.
Let's see with an example.
I extract 100 ages between 20 and 100:
ages=randsample([20:100],100,true);
Now I call histogram() in this manner:
h=histogram(ages,[20:100]);
where h is an histogram object and this will also show the following plot:
However, this might look easy due to the fact that my ages vector is in range 20:100, so it will not contain any other values. If your vector, as instead, contains also ages not in range 20:100, you can specify the additional option 'BinLimits' as third input in histogram() like this:
h=histogram(ages,length([20:100]),'BinLimits',[20:100]);
and this option plots a histogram using the values in ages that fall between 20 and 100 inclusive.
Note: by inspecting h you can actually see and/or edit some proprieties of your histogram. An attribute (field) of such object you might be interested to is Values. This is a vector of length 80 (in our case, since we work with 80 bins) in which the i-th element is the number of items is the i-th bin. This will help you count the occurrences (just in case you need them to go on with your analysis).
Like Luis said in comments, hist is the way to go. You should specify bin edges, rather than the number of bins:
ages = randi([20 100], [1 10000]);
hist(ages, [20:100])
Is this what you were looking for?

MATLAB - histograms of equal size and histogram overlap

An issue I've come across multiple times is wanting to take two similar data sets and create histograms from them where the bins are identical, so as to easily calculate things like histogram overlap.
You can define the number of bins (obviously) using
[counts, bins] = hist(data,number_of_bins)
But there's not an obvious way (as far as I can see) to make the bin size equal for several different data sets. If remember when I initially looked finding various people who seem to have the same issue, but no good solutions.
The right, easy way
As pointed out by horchler, this can easily be achieved using either histc (which lets you define your bins vector), or vectorizing your histogram input into hist.
The wrong, stupid way
I'm leaving below as a reminder to others that even stupid questions can yield worthwhile answers
I've been using the following approach for a while, so figured it might be useful for others (or, someone can very quickly point out the correct way to do this!).
The general approach relies on the fact that MATLAB's hist function defines an equally spaced number of bins between the largest and smallest value in your sample. So, if you append a start (smallest) and end (largest) value to your various samples which is the min and max for all samples of interest, this forces the histogram range to be equal for all your data sets. You can then truncate the first and last values to recreate your original data.
For example, create the following data set
A = randn(1,2000)+7
B = randn(1,2000)+9
[counts_A, bins_A] = hist(A, 500);
[counts_B, bins_B] = hist(B, 500);
Here for my specific data sets I get the following results
bins_A(1) % 3.8127 (this is also min(A) )
bins_A(500) % 10.3081 (this is also max(A) )
bins_B(1) % 5.6310 (this is also min(B) )
bins_B(500) % 13.0254 (this is also max(B) )
To create equal bins you can simply first define a min and max value which is slightly smaller than both ranges;
topval = max([max(A) max(B)])+0.05;
bottomval = min([min(A) min(B)])-0.05;
The addition/subtraction of 0.05 is based on knowledge of the range of values - you don't want your extra bin to be too far or too close to the actual range. That being said, for this example by using the joint min/max values this code will work irrespective of the A and B values generated.
Now we re-create histogram counts and bins using (note the extra 2 bins are for our new largest and smallest value)
[counts_Ae, bins_Ae] = hist([bottomval, A, topval], 502);
[counts_Be, bins_Be] = hist([bottomval, B, topval], 502);
Finally, you truncate the first and last bin and value entries to recreate your original sample exactly
bins_A = bins_Ae(2:501)
bins_B = bins_Ae(2:501)
counts_A = counts_Ae(2:501)
counts_B = counts_Be(2:501)
Now
bins_A(1) % 3.7655
bins_A(500) % 13.0735
bins_B(1) % 3.7655
bins_B(500) % 13.0735
From this you can easily plot both histograms again
bar([bins_A;bins_B]', [counts_A;counts_B]')
And also plot the histogram overlap with ease
bar(bins_A,(counts_A+counts_B)-(abs(counts_A-counts_B)))

Arbitrary distribution -> Uniform distribution (Probability Integral Transform?)

I have 500,000 values for a variable derived from financial markets. Specifically, this variable represents distance from the mean (in standard deviations). This variable has a arbitrary distribution. I need a formula that will allow me to select a range around any value of this variable such that an equal (or close to it) amount of data points fall within that range.
This will allow me to then analyze all of the data points within a specific range and to treat them as "similar situations to the input."
From what I understand, this means that I need to convert it from arbitrary distribution to uniform distribution. I have read (but barely understood) that what I am looking for is called "probability integral transform."
Can anyone assist me with some code (Matlab preferred, but it doesn't really matter) to help me accomplish this?
Here's something I put together quickly. It's not polished and not perfect, but it does what you want to do.
clear
randList=[randn(1e4,1);2*randn(1e4,1)+5];
[xCdf,xList]=ksdensity(randList,'npoints',5e3,'function','cdf');
xRange=getInterval(5,xList,xCdf,0.1);
and the function getInterval is
function out=getInterval(yPoint,xList,xCdf,areaFraction)
yCdf=interp1(xList,xCdf,yPoint);
yCdfRange=[-areaFraction/2, areaFraction/2]+yCdf;
out=interp1(xCdf,xList,yCdfRange);
Explanation:
The CDF of the random distribution is shown below by the line in blue. You provide a point (here 5 in the input to getInterval) about which you want a range that gives you 10% of the area (input 0.1 to getInterval). The chosen point is marked by the red cross and the
interval is marked by the lines in green. You can get the corresponding points from the original list that lie within this interval as
newList=randList(randList>=xRange(1) & randList<=xRange(2));
You'll find that on an average, the number of points in this example is ~2000, which is 10% of numel(randList)
numel(newList)
ans =
2045
NOTE:
Please note that this was done quickly and I haven't made any checks to see if the chosen point is outside the range or if yCdfRange falls outside [0 1], in which case interp1 will return a NaN. This is fairly straightforward to implement, and I'll leave that to you.
Also, ksdensity is very CPU intensive. I wouldn't recommend increasing npoints to more than 1e4. I assume you're only working with a fixed list (i.e., you have a list of 5e5 points that you've obtained somehow and now you're just running tests/analyzing it). In that case, you can run ksdensity once and save the result.
I do not speak Matlab, but you need to find quantiles in your data. This is Mathematica code which would do this:
In[88]:= data = RandomVariate[SkewNormalDistribution[0, 1, 2], 10^4];
Compute quantile points:
In[91]:= q10 = Quantile[data, Range[0, 10]/10];
Now form pairs of consecutive quantiles:
In[92]:= intervals = Partition[q10, 2, 1];
In[93]:= intervals
Out[93]= {{-1.397, -0.136989}, {-0.136989, 0.123689}, {0.123689,
0.312232}, {0.312232, 0.478551}, {0.478551, 0.652482}, {0.652482,
0.829642}, {0.829642, 1.02801}, {1.02801, 1.27609}, {1.27609,
1.6237}, {1.6237, 4.04219}}
Verify that the splitting points separate data nearly evenly:
In[94]:= Table[Count[data, x_ /; i[[1]] <= x < i[[2]]], {i, intervals}]
Out[94]= {999, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000}