understand and coding of the zero-lag cross-correlation matlab - matlab

First of all, I am sorry if I am a dummy and cant understand this part of an article. I have a set of data with 200 channels in which every specific two channels are co-dependent. In the paper it is mentioned:
"
For each channel, we filtered both signals
between 0.5 and 2.5 Hz to preserve only the cardiac component and
normalized the resulting signals to balance any difference between
their amplitude.
"
Question1: this means I need to normalize both co-dependent channels to the average of the median? or just normalize each signal to its own median?
Here is the rest of the paragraph
"
Then, we computed the cross-correlation
extracted the value at a time lag of 0 to quantify the similarity
between the filtered signals. In-phase and counter-phase identical
waveforms yielded a zero-lag cross-correlation value of 1 and -1
respectively, whereas a null value derived from totally uncorrelated
signals. "
I wrote the code below: but I get -1 or plus one everywhere even for signals that are not codependent it gives me 1 or -1. I guess I am wrong in part of the code but rationally I don't know where. Here is the code
datafile='data_sess_03.nirs'
ch_num=1
[w,src,det,mlOrg,mlo,mlm,Data,datap,acc1,acc2]=readData(datafile);
fc=[0.5 2.5];
dataf=filterData(Data,fc);
[c,lags]=xcorr(dataf(1,:),dataf(5,:),0); % channel 1 and 5 are
codependent
%% c is -1 and plus one every when even in the noisy channels
plot(acor,'black')
[~,I] = max(abs(acor));
lagDiff = lag(I)/fs
Any help will be really appreciated. Thanks a lot to help me

Related

Modeling an hrf time series in MATLAB

I'm attempting to model fMRI data so I can check the efficacy of an experimental design. I have been following a couple of tutorials and have a question.
I first need to model the BOLD response by convolving a stimulus input time series with a canonical haemodynamic response function (HRF). The first tutorial I checked said that one can make an HRF that is of any amplitude as long as the 'shape' of the HRF is correct so they created the following HRF in matlab:
hrf = [ 0 0 1 5 8 9.2 9 7 4 2 0 -1 -1 -0.8 -0.7 -0.5 -0.3 -0.1 0 ]
And then convolved the HRF with the stimulus by just using 'conv' so:
hrf_convolved_with_stim_time_series = conv(input,hrf);
This is very straight forward but I want my model to eventually be as accurate as possible so I checked a more advanced tutorial and they did the following. First they created a vector of 20 timepoints then used the 'gampdf' function to create the HRF.
t = 1:1:20; % MEASUREMENTS
h = gampdf(t,6) + -.5*gampdf(t,10); % HRF MODEL
h = h/max(h); % SCALE HRF TO HAVE MAX AMPLITUDE OF 1
Is there a benefit to doing it this way over the simpler one? I suppose I have 3 specific questions.
The 'gampdf' help page is super short and only says the '6' and '10' in each function call represents 'A' which is a 'shape' parameter. What does this mean? It gives no other information. Why is it 6 in the first call and 10 in the second?
This question is directly related to the above one. This code is written for a situation where there is a TR = 1 and the stimulus is very short (like 1s). In my situation my TR = 2 and my stimulus is quite long (12s). I tried to adapt the above code to make a working HRF for my situation by doing the following:
t = 1:2:40; % 2s timestep with the 40 to try to equate total time to above
h = gampdf(t,6) + -.5*gampdf(t,10); % HRF MODEL
h = h/max(h); % SCALE HRF TO HAVE MAX AMPLITUDE OF 1
Because I have no idea what the 'gampdf' parameters mean (or what that line does, in all actuality) I'm not sure this gives me what I'm looking for. I essentially get out 20 values where 1-14 have SOME numeric value in them but 15-20 are all 0. I'm assuming there will be a response during the entire 12s stimulus period (first 6 TRs so values 1-6) with the appropriate rectification which could be the rest of the values but I'm not sure.
Final question. The other code does not 'scale' the HRF to have an amplitude of 1. Will that matter, ultimately?
The canonical HRF you choose is dependent upon where in the brain the BOLD signal is coming from. It would be inappropriate to choose just any HRF. Your best source of a model is going to come from a lit review. I've linked a paper discussing the merits of multiple HRF models. The methods section brings up some salient points.

MATLAB Simple - Linear Predictive Coding and Energy Forecasting

I have a dataset with 274 samples (9 months) of the daily energy (Watts.hour) used on a residential household. I'm not sure if i'm applying the lpc function correctly.
My code is the following:
filename='9-months.csv';
energy = csvread(filename);
C=zeros(5,1);
counter=0;
N=3;
for n=274:-1:31
w2=energy(1:n-1,1);
a=lpc(w2,N);
energy_estimated=0;
for X = 1:N
energy_estimated = energy_estimated + (-a(X+1)*energy(n-X));
end
w_real=energy(n);
error2=abs(w_real-energy_estimated);
counter=counter+1;
C(counter,1)=error2;
end
mean_error=round(mean(C));
Being "n" the sample on analysis, I will use the energy array's values, from 1 to n-1, to calculate the lpc coefficientes (with N=3).
After that, it will apply the calculated coefficients on the "for" cycle presented, in order to calculate the estimated energy.
Finally, error2 outputs the error between the real energy and estimated value.
On the example presented ( http://www.mathworks.com/help/signal/ref/lpc.html ) some filters are used. Do I need to apply any filter to it? Is my methodology correct?
Thank you very much in advance!
The lpc seems to be used correctly, but there are a few other things about your code. I am adressign the part at he "for n" :
for n=31:274 %for me it would seem more logically to go forward in time
w2=energy(1:n-1,1);
a=lpc(w2,N);
energy_estimate=filter([0 -a(2:end)],1,w2);
energy_estimate=energy_estimate(end);
estimates(n)=energy_estimate;
end
error=energy(31:274)-estimates(31:274)';
meanerror=mean(error); %you dont really round mean errors
filter is exactly what you are trying to do with the X=1:N loop. but this will perform the calculation for the entire w2 vector. If you just want the last value take the (end) command as well.
Now there is no reason to calculate the error for every single value and then add them to a vector you can do that faster after the calculation.
Now if your trying to estimate future values with a lpc it could work like that, but you are implying that every value is only dependend on the last 3 values. Have you tried something like a polynominal approach? i would think that this would be closer to reality.

MSE in neuralnet results and roc curve of the results

Hi my question is a bit long please bare and read it till the end.
I am working on a project with 30 participants. We have two type of data set (first data set has 30 rows and 160 columns , and second data set has the same 30 rows and 200 columns as outputs=y and these outputs are independent), what i want to do is to use the first data set and predict the second data set outputs.As first data set was rectangular type and had high dimension i have used factor analysis and now have 19 factors that cover up to 98% of the variance. Now i want to use these 19 factors for predicting the outputs of the second data set.
I am using neuralnet and backpropogation and everything goes well and my results are really close to outputs.
My questions :
1- as my inputs are the factors ( they are between -1 and 1 ) and my outputs scale are between 4 to 10000 and integer , should i still scaled them before running neural network ?
2-I scaled the data ( both input and outputs ) and then predicted with neuralnet , then i check the MSE error it was so high like 6000 while my prediction and real output are so close to each other. But if i rescale the prediction and outputs then check The MSE its near zero. Is it unbiased to rescale and then check the MSE ?
3- I read that it is better to not scale the output from the beginning but if i just scale the inputs all my prediction are 1. Is it correct to not to scale the outputs ?
4- If i want to plot the ROC curve how can i do it. Because my results are never equal to real outputs ?
Thank you for reading my question
[edit#1]: There is a publication on how to produce ROC curves using neural network results
http://www.lcc.uma.es/~jja/recidiva/048.pdf
1) You can scale your values (using minmax, for example). But only scale your training data set. Save the parameters used in the scaling process (in minmax they would be the min and max values by which the data is scaled). Only then, you can scale your test data set WITH the min and max values you got from the training data set. Remember, with the test data set you are trying to mimic the process of classifying unseen data. Unseen data is scaled with your scaling parameters from the testing data set.
2) When talking about errors, do mention which data set the error was computed on. You can compute an error function (in fact, there are different error functions, one of them, the mean squared error, or MSE) on the training data set, and one for your test data set.
4) Think about this: Let's say you train a network with the testing data set,and it only has 1 neuron in the output layer . Then, you present it with the test data set. Depending on which transfer function (activation function) you use in the output layer, you will get a value for each exemplar. Let's assume you use a sigmoid transfer function, where the max and min values are 1 and 0. That means the predictions will be limited to values between 1 and 0.
Let's also say that your target labels ("truth") only contains discrete values of 0 and 1 (indicating which class the exemplar belongs to).
targetLabels=[0 1 0 0 0 1 0 ];
NNprediction=[0.2 0.8 0.1 0.3 0.4 0.7 0.2];
How do you interpret this?
You can apply a hard-limiting function such that the NNprediction vector only contains the discreet values 0 and 1. Let's say you use a threshold of 0.5:
NNprediction_thresh_0.5 = [0 1 0 0 0 1 0];
vs.
targetLabels =[0 1 0 0 0 1 0];
With this information you can compute your False Positives, FN, TP, and TN (and a bunch of additional derived metrics such as True Positive Rate = TP/(TP+FN) ).
If you had a ROC curve showing the False Negative Rate vs. True Positive Rate, this would be a single point in the plot. However, if you vary the threshold in the hard-limit function, you can get all the values you need for a complete curve.
Makes sense? See the dependencies of one process on the others?

Trying to produce exponential traffic

I'm trying to simulate an optical network algorithm in MATLAB for a homework project. Most of it is already done, but I have an issue with the diagrams I'm getting.
In the simulation I'm generating exponential traffic, however, for low lambda values (0.1) I'm getting very high packet drop rates (99%). I wrote a sample here which is very close to the testbench I'm running on my simulator.
% Run the simulation 10 times, with different lambda values
l = [1 2 3 4 5 6 7 8 9 10];
for i=l(1):l(end)
X = rand();
% In the 'real' simulation the following line defines the time
% when the next packet generation event will occur. Suppose that
% i is the current time
t_poiss = i + ceil((-log(X)/(i/10)));
distr(i)=t_poiss;
end
figure, plot(distr)
axis square
grid on;
title('Exponential test:')
The resulting image is
The diagram I'm getting in this sample is IDENTICAL to the diagram I'm getting for the drop rate/λ. So I would like to ask if I'm doing something wrong or if I miss something? Is this the right thing to expect?
So the problem is coming from might be a numerical problem. Since you are generating a random number for X, the number might be incredibly small - say, close to zero. If you have a number close to zero numerically, log(X) is going to be HUGE. So your calculation of t_poiss will be huge. I would suggest doing something like X = rand() + 1 to make sure that X is never close to zero.

Match Two Sets of Measurement Data With Different Logging Start Times and End Times

Problem
I have two arrays (Xa and Xb) that contain measurements of the same physical signal, but they are taken at different sample rates. Lastly, physical logging of Xa data starts at a different time, than that of Xb. The logging of data also stops at different time.
i.e.
(The following is just a summary of important statements, not code.)
sampleRatea > sampleRateb % Resolution of Xa is greater than that of Xb
t0a ~= t0b % Start times are not equal
t1a ~= t1b % End times are not equal
Objective
Find the necessary shift in indices that will best line up these sets of data.
Approach
Use fmincon to find the index that minimizes the mean squared error (MSE) between versions Xa and Xb that are edited to have the same sample rate (perhaps using the interpolation function).
I have tried to do this but it always seems that I have too many degrees of freedom. Is there anyone who can shed some light on a process that might facilitate this process?
Assuming you have two samples with constant frequencies, the problem reduces to something quite simple:
Find scale, location such that:
Xa , at timestamps corresponding to its index, makes the best match with Xb at timstamps corresponding to location + scale * its index.
If you agree with this you can see that only two degrees of freedom are left, if you know the ratio of sample rates it even reduces to just 1 degree of freedom.
I believe that now the hard part is done, but some work still remains:
Judge how good two samples with timestamps and values match
Find the optimal combination of your location and scale parameter
Note that, assuming you complete these 2 steps properly, the solution should be optimal for finding the optimal timestamps. As you are looking for a shift in (integer) indices, translating these timestamps back to indices may not be result in the real optimum but it should be pretty close.
Here is a quick-and-dirty solution that should be enough to get you started. Given your input signals Xa and Xb sampled at sampleRatea and sampleRateb respectively:
g = gcd(sampleRatea,sampleRateb);
Ya = interp(Xa,sampleRateb/g);
Yb = interp(Xb,sampleRatea/g);
Yfs = sampleRatea*sampleRateb/g;
[acor,lag] = xcorr(Ya,Yb);
time_shift = lag(acor == max(acor))/Yfs;
The variable time_shift will tell you the time elapsed between the start of A and the start of B. If B starts first, the result will be negative.
If your sampling rates are relatively prime, this will be horribly inefficient. If one is an integer multiple of the other, or they have a relatively large GCD, it will be much better.