Collapse/mean data in Matlab with respect to a different set of data - matlab

I have two sets of data, but the sets have a different sizes.
Each set contains the measurements itself (MeasA and MeasB, both double) and the time point (TimeA and TimeB, datenum or julian date) when the measuring happened.
Now I want to match the smaller data set to the bigger one, and to do this, I want to mean the data points of the bigger set around the data resp. time points of the smaller set, to finally do some correlation analysis.
Edit:
Small Example how the data would look like:
MeasA = [2.7694 -1.3499 3.0349 0.7254 -0.0631];
TimeA = [0.2 0.4 0.7 0.8 1.3];
MeasB = [0.7147 -0.2050 -0.1241 1.4897 1.4090 1.4172 0.6715 -1.2075 0.7172 1.6302];
TimeB = [0.1 0.2 0.3 0.6 0.65 0.68 0.73 0.85 1.2 1.4];
And now I want to collapse MeasB and TimeB so that I get the mean of the measurement close to the timepoints in TimeA, so for example TimeB should look like this:
TimeB = [mean([0.1 0.2]) mean([0.3 0.6]) mean([0.65 0.68 0.73]) mean([0.85]) mean([1.2 1.4])]
TimeB = [0.15 0.4 0.69 0.85 1.3]
And then collapse MeasB like this too:
MeasB = [mean([0.7147 -0.2050]) mean([-0.1241 1.4897]) mean([1.4090 1.4172 0.6715]) mean([-1.2075]) mean([0.7172 1.6302])];
MeasB = [0.2549 0.6828 1.1659 -1.2075 1.1737]

The function interp1 is your friend.
You can get a new set of measurement for your set B, at the same time than set A by using:
newMeasB = interp1( TimeB , MeasB , TimeA ) ;
The first 2 parameters are your original Time and Measurements of the set you want to re interpolate, the last parameter is the new x axis (time in your example) on which you want the interpolated values to be calculated.
This way you do not end up with different sets of time between your 2 sets of measurements, you can compare them point by point.
Check the documentation of interp1 for more explanations and for options about the interpolation or any potential extrapolation.
edit:
Matlab doc used to have a great illustration of the function but I can't find it online so here goes:
So with the linear method, if the value is interpolated exactly between 2 points, the function will return the exact mean. If the interpolation is done closer to one point than another, the value returned will be proportionally closer to the value of the closest point.
The NaN can appear on the sides (beginning or end of returned vector) if the TimeA was not completely overlapped by timeB. The function cannot "interpolate" because there is no anchor point. However, the different options of interp1 allow you to "extrapolate" outside of the input range, or to assign another default value instead of the NaNs.

Related

Huffman Coding for Markov Chain based on conditional distribution

Before I start describing my problem, I would like to note that this question is for a project for one of my courses at University, so I do not seek for the solution, rather for a hint or an explanation.
So, lets assume that there are 3 states {1,2,3} and I also have the Transition probability Matrix (3x3). I wrote a matlab script that based on the transition matrix, it creates a vector with N samples for the Markov Chain. Assume that the first state is the state 1. Now, I need to Huffman code this chain based on the conditional distribution pXn |Xn−1 .
If I am not mistaken, I think that I have to create 3 Huffman dictionaries and encode each symbol from the chain above, based on the previous state(?), which means that each symbol is going to be encoded with one out of the three dictionaries I created, but not all of them with the same dictionary.
If the encoding process is correct, how do I decode the coded vector?
I am not really sure if that's how it should be done.
Any ideas would be appreciated.
Thanks in advance!
That's right. There would be a Huffman code for the three symbols p11, p12, and p13, another for p21, p22, p23, etc.
Decoding chooses which code to use based on the current state. There needs to either be an assumption for the starting state, or the starting state needs to be transmitted.
However this case is a little odd, since there is only one Huffman code for three symbols, consisting of 1 bit, 2 bits, and 2 bits. E.g. 0, 10, 11. So the only gain you get is by picking the highest probability for the one-bit symbol.
Well, having solved the problem above, I decided to post the answer with the octave script in case anyone needs it in future.
So, lets assume that there are 5 states {1,2,3,4,5} and I also have the Transition probability Matrix (5x5). I Huffman encoded and decoded the Markov chain for 1000 Monte Carlo experiments.
The Octave Script is:
%starting State of the chain
starting_value = 1;
%Chain Length
chain_length = 100;
%# of Monte Carlo experiments
MC=1000;
%Variable to count all correct coding/encoding experiments
count=0;
%Create unique symbols, and assign probabilities of occurrence to them.
symbols = 1:5;
p1 = [.5 .125 .125 .125 0.125];
p2 = [.25 .125 .0625 .0625 0.5];
p3 = [.25 .125 .125 .25 0.25];
p4 = [.125 0 .5 .25 0.125];
p5 = [0 .5 .25 .25 0];
%Create a Huffman dictionary based on the symbols and their probabilities.
dict1 = huffmandict(symbols,p1);
dict2 = huffmandict(symbols,p2);
dict3 = huffmandict(symbols,p3);
dict4 = huffmandict(symbols,p4);
dict5 = huffmandict(symbols,p5);
% Create the transition matrix for each state
T= [0.5 0.125 0.125 0.125 0.125;
0.25 0.125 0.0625 0.0625 0.5;
0.25 0.125 0.125 0.25 0.25;
0.125 0 0.5 0.25 0.125 ;
0 0.5 0.25 0.25 0];
%Initialize Marcov Chain
chain = zeros(1,chain_length);
chain(1)=starting_value;
for i=1 :MC
comp=[];
dsig=[];
%Create Markov Chain
for i=2:chain_length
this_step_distribution = T(chain(i-1),:);
cumulative_distribution = cumsum(this_step_distribution);
r = rand();
chain(i) = find(cumulative_distribution>r,1);
end
comp=huffmanenco(chain(1),dict1);
%Encode the random symbols.
for i=2:chain_length
if chain(i-1)==1
comp = horzcat(comp,huffmanenco(chain(i),dict1));
elseif chain(i-1)==2
comp = horzcat(comp,huffmanenco(chain(i),dict2));
elseif chain(i-1)==3
comp = horzcat(comp,huffmanenco(chain(i),dict3));
elseif chain(i-1)==4
comp = horzcat(comp,huffmanenco(chain(i),dict4));
elseif chain(i-1)==5
comp = horzcat(comp,huffmanenco(chain(i),dict5));
end
end
%Decode the data. Verify that the decoded data matches the original data.
dsig(1)=starting_value;
comp=comp(length(dict1{1,1})+1:end);
for i=2:chain_length
if dsig(end)==1
temp=huffmandeco(comp,dict1);
comp=comp(length(dict1(temp(1)){1,1})+1:end);
elseif dsig(end)==2
temp=huffmandeco(comp,dict2);
comp=comp(length(dict2(temp(1)){1,1})+1:end);
elseif dsig(end)==3
temp=huffmandeco(comp,dict3);
comp=comp(length(dict3(temp(1)){1,1})+1:end);
elseif dsig(end)==4
temp=huffmandeco(comp,dict4);
comp=comp(length(dict4(temp(1)){1,1})+1:end);
elseif dsig(end)==5
temp=huffmandeco(comp,dict5);
comp=comp(length(dict5(temp(1)){1,1})+1:end);
end
dsig=horzcat(dsig,temp(1));
end
count=count+isequal(chain,dsig);
end
count
The "variable" count is to make sure that in all of the MC experiments, the Markov Chain that was produced was properly encoded and decoded. (Obviously, if count equals to 1000, then all the experiments had correct results)

MATLAB scrolling plot

I have an EEG data base that I would like to plot.
The database is a 19*1000*134 matrix, with:
19 being the number of channel. On a first approach, I'm working with only one channel.
1000 the size of a sample (1000 points for a sampling rate of 500 Hz, i.e. 2 sec of data)
134 the number of epochs (number of different 2 second experience)
The idea is to plot epoch n right after epoch n-1 on the same graph. The (X,Y) matrix used to plot this has a 134 000 * not_much size, and I would like to be able to scroll horizontally on the plot, to see individually each epoch.
My code right now, plotting only one channel:
fs = s_EEG.sampling_rate;
[channel, length, nb_epoch] = size(s_EEG.data)
display(s_EEG.data, fs, length, channel, nb_epoch)
function display(data, fs, length, channel, nb_epoch)
figure("Name", "Epoch display")
for j = 1:nb_epoch
time = 0.002+(2*j-2):1/fs:2*j;
epoch = data(1,:,j);
plot(time, epoch)
hold on
end
hold off
end
Current output:
I'm completely new to Matlab, and I don't use it well yet, but I would like to find a way to see on the same graph, individually, and at a correct visualization scale, all of my 134 epochs (one color = one epoch above).
Thanks !
This is very similar to something I already had so I tweaked it a bit for you. Basically pass plotData your data matrix. It will plot each of your items sequentially as you already have now.
Pressing the slider will change your x-limits so that you will step through 1 element (epochs) at a time. Clicking in the area will advance 2-epochs at a time. It currently just displays what you currently viewed "epoch" # is at the command line disp(['Current Epoch: ' num2str(viewI)]) However, it should be easy for you to redirect that to a text box on the figure to more readily know which you are viewing ... besides mentally dividing the x-limits by 2.
Use the list box to switch to a new channel which will reset the plot & x-limits.
Call it like this at the command line.
>> plotData( data )
CODE: Save everything below as plotData.m
function plotData( data )
% data = rand(19,1000,134);
f = figure('Units','Normalized','Position',[0.25 0.25 0.5 0.5]);
a = axes('Units','Normalized','Position',[0.05 0.15, 0.75 0.75]);
s = uicontrol(f, 'Style','Slider', 'Units','Normalized','Position',[0.05 0.025, 0.75 0.05],...
'Min',1,'Max',size(data,3),'Value',1, 'Callback',{#sliderChange,a} );
l = uicontrol(f, 'Style','listbox','Units','Normalized','Position',[0.85 0.15, 0.1, 0.75],...
'String',cellstr(num2str([1:size(data,1)]')),'Callback',{#changeChannel,a,s,data} );
stepSize = 1/(s.Max - s.Min);
s.SliderStep = [stepSize 2*stepSize];
changeChannel(l,[],a,s,data)
function changeChannel(l,evtData,a,s,data)
cla(a);
chanNum = str2double(l.String{l.Value});
sR = 500; %500Hz
tempData = reshape(data(chanNum,:,:),[],size(data,3)); %Reshape each epoch into a column
tempTime = [0:1/sR:(size(data,2)-1)/sR]' + (0:1:size(data,3)-1)*2; %Build time array
plot(a,tempTime,tempData) %plot all the lines
s.Value = 1; %Rest Slider Position
function sliderChange(s,evtData,a)
viewI = round(s.Value);
disp(['Current Epoch: ' num2str(viewI)])
xlim(a,[(viewI-1)*2 viewI*2] + [-.1 .1])
Note: This uses implicit expansion so you need Matlab 2016b or higher.

How to find and label the most frequent values with a tiny variance in MATLAB?

I have a vector d:
d = [
1.19011941712580e-06
6.39136179286748e-06
1.26442316296575e-05
1.81039120389278e-05
1.91304903300688e-05
2.19912290910362e-05
2.94113112667430e-05
3.42238417249065e-05
4.14201181268186e-05
5.76014376298924e-05
6.81337071520188e-05
0.000108396864465101
0.000130922201344182
0.000145712942644687
0.000174386494384153
0.000262758083529471
03050975943883
0.000373066486719321
0.000423949134658855
0.000489079623696380
0.000548432526451254
0.000694787830192734
0.000881370593483890
0.00125516689720339
0.00145237435686831
0.00815957230852142
0.0210146005799470
0.0507995676939279
0.0541594307796186
1
]
Plotting d:
plot(d, 'x:')
In this situation, [M, F] = mode(d) gives a result that I didn't want.
Is there any function that counts the most frequent values which takes a sort of tolerance into account?
Clustering can be considered. However, In the figure above, clustering may assign d(27:29) into the left side cluster.
Current approach is normilizing and thresholding:
d_norm = d / max(d);
v = d_norm(d_norm < 0.01); % 1 percent threshold
However, I think it is a sort of hard-coded and not a good approach.
Histcounts is your friend!
Your "threshold" can be easily translated to histogram bins if you know your range. In your case, the range is 0-1, if you choose a threshold of 0.01 then 100 is your bin count.
counts=histcounts(d,100)

Matlab standard deviation - treat vector as probabilities, not values

I have a vector in Matlab that looks something like:
vect = 0
100
300
500
700
1000
500
300
200
0
When normalised, each value should indicate the probability of a certain value, and my values are just 1 to 10 (i.e. 0% chance of 1, 100/sum(vect) chance of 2, etc).
How do I work out statistics on the value (in particular standard deviation)..? If I do mean(vect), I just end up with 360, and I get a similarly large value for standard deviation. The mean value should, of course, be around 5. I'm sure it wouldn't be too hard to code up manually at all, but there must be a way of doing this directly in Matlab, so I figured I'd ask!
I am not really sure if matlab have any built in function for this, but it is no big deal. Both are one liners anyway
vect = [0; 100; 300; 500; 700; 1000; 500; 300; 200; 0];
prob = vect./sum(vect);
val = [1:10].';
meanVal = sum(prob.*val);
stDev = sqrt( sum( prob.*val.^2 ) -sum(prob.*val)^2 );
EDIT:
There are two functions that does this. They are called mean and std as well. But they take a probability distribution object instead.
If you call stem(vect) you'll see that vect is the probability density function of a normally distributed variable, hence you can fit a normal distribution to vect without normalization
x = (1:length(vect))';
pdf = fitdist(x, 'normal', 'freq', vect);
The result has an average value of 5.63889 and a standard deviation of 1.66944.

Unexpected behaviour of function findpeaks in MATLAB's Signal Processing Toolbox

Edit: Actually this is not unexpected behaviour, but I still need a solution.
findpeaks compares each element of data to its neighboring values.
I have data which contains peaks which I detect with the function findpeaks from the Signal Processing Toolbox. Sometimes the function seems not to detect the peaks properly, when I have the same value twice next to each other. This occurs very rarly in my data, but here is a sample to illustrate my problem:
>> values
values =
-0.0324
-0.0371
-0.0393
-0.0387
-0.0331
-0.0280
-0.0216
-0.0134
-0.0011
0.0098
0.0217
0.0352
0.0467
0.0548
0.0639
0.0740
0.0813
0.0858 <-- here should be another peak
0.0858 <--
0.0812
0.0719
0.0600
0.0473
0.0353
0.0239
0.0151
0.0083
0.0034
-0.0001
-0.0025
-0.0043
-0.0057
-0.0048
-0.0038
-0.0026
0.0007
0.0043
0.0062
0.0083
0.0106
0.0111
0.0116
0.0102
0.0089
0.0057
0.0025
-0.0025
-0.0056
Now the findpeaks function only finds one peak:
>> [pks loc] = findpeaks(values)
pks =
0.0116
loc =
42
If I plot the data, it becomes obvious that findpeaks misses one of the peaks at the location 18/19 because they both have the value 0.08579.
What is the best way to find those missing peaks?
If you have the image processing toolbox, you can use IMREGIONALMAX to find the peaks, after which you can use regionprops to find the center of the regions (if that's what you need), i.e.
bw = imregionalmax(signal);
peakLocations = find(bw); %# returns n peaks for an n-tuple of max-values
stats = regionprops(bw,'Centroid');
peakLocations = cat(1,stats.Centroid); %# returns the center of the n-tuple of max-values
This is an old topic, but maybe some are still looking for an easier solution to this (like I did today):
You could also just substract some very small fixed value from all values on a plateau, except from the first value. This causes each first value on a plateau to always be the highest on the respective plateaus, causing them to be included as peaks.
Just make something like this part of your code:
peaks = yourdata;
verysmallvalue = .001;
plateauvalue = peaks(1);
for i = 2:size(peaks,1)
if peaks(i) == plateauvalue
peaks(i) = peaks(i) - verysmallvalue;
else
plateauvalue = peaks(i);
end
end
[PKS,LOCS] = findpeaks(peaks);
plot(yourdata);
hold on;
plot(LOCS, yourdata(LOCS), 'Color', 'Red', 'Line', 'None', 'Marker', 'o');
Hope this helps!
Use the second derivative test instead?
I ended up writing my own simpler version of findpeaks, which seems to work for my purpose.
function [pks,locs] = my_findpeaks(X)
M = numel(X);
pks = [];
locs = [];
if (M < 4)
datamsgid = generatemsgid('emptyDataSet');
error(datamsgid,'Data set must contain at least 4 samples.');
else
for idx=1:M-3
if X(idx)< X(idx+1) && X(idx+1)>=X(idx+2) && X(idx+2)> X(idx+3)
pks = [pks X(idx)];
locs = [locs idx];
end
end
end
end
Edit: To clarify, the problem arose, when I had a peak which was exactly between two sample points and those two sample points had coincidentally the same value. It only happend a couple of times in more than 10.000 cases.
The behavior that you describe is a known bug in versions of MATLAB prior to R2010b. The minimum example is
findpeaks([0 1 1 0])
which returns [], while
findpeaks([0 1 0])
returns the (position of the) peak.
The bug has been fixed in R2010b and later, see the official Bug Report. With that fix, findpeaks returns the rising edge of "peaks with repeated values" (which I would call plateaus).