I have an EMG signal of a subject walking on a treadmill.
We used footswitches to be able to see when the subject is placing his foot, so we can see how many periods (steps) there are in time.
We would like to cut the signal in periods (steps) and get one average signal (100% step cycle).
I tried the reshape function but it does not work
when I count 38 steps:
nwaves = 38;
sig2 = reshape(sig,[numel(sig)/nwaves nwaves])';
avgSig = mean(sig2,1);
plot(avgSig);
the error displayed is this: Size arguments must be real integers.
Can anyone help me with this? Thanks!
First of all, reshaping the array is a bad approach to the problem. In real world one cannot assume that the person on the treadmill will step rhythmically with millisecond-precision (i.e. for the same amount of samples).
A more realistic approach is to use the footswitch signal: assume is really a switch on a single foot (1=foot on, 0=foot off), and its actions are filtered to avoid noise (Schmidt trigger, for example), you can get the samples index when the foot is removed from the treadmill with:
foot_off = find(diff(footswitch) < 0);
then you can transform your signal in a cell array (variable lengths) of vectors of data between consecutive steps:
step_len = diff([0, foot_off, numel(footswitch)]);
sig2 = mat2cell(sig(:), step_len, 1);
The problem now is you can't apply mean() to the signal slices in order to get an "average step": you must process each step first, then average the results.
It's probably because numel(sig)/nwaves isn't an integer. You need to round it to the nearest integer with round(numel(sig)/nwaves).
EDIT based on comments:
Your problem is you can't divide 51116 by 38 (it's 1345.2), so you can't reshape your signal in chunks of 38 long. You need a signal whose length is exactly a multiple of 38 if you want to be able to reshape it in chunks of 38. Either that, or remove the last (or first) 6 values from your signal to have an exact multiple of 38 (1345 * 38 = 51110):
nwaves = 38;
n_chunks = round(numel(sig)/nwaves);
max_sig_length = n_chunks * nwaves;
sig2 = reshape(sig(1:max_sig_length),[n_chunks nwaves])';
avgSig = mean(sig2,1);
plot(avgSig);
Related
I am using MATLAB to find the number of peaks of a signal.
I'm trying to plot the number of peaks of a signal filtered with N-point moving average filter, N goes from 2 to 30.(I also consider the number of peaks when no filter has applied at the beginning of the resulting array) My data array(imported from csv and has double values between 0 and 1) has around 50k points. When I give part of the data i.e 100, 500 or 1000 points, using array slicing, # of peaks decrease as expected. However, when I give the whole data or even 2000 points, the number of peaks stays same at 127.
I changed the number of data given to the filter to find out why this happens. I changed the commented lines like showed in the comment and tried. When less than 1000 data points given plot was fine.
Here is the signal
https://www.dropbox.com/s/e1bkcjn5ta5q610/exampleSignal.csv?dl=0
Please import it from 4th element to end, it has some strange data at the beginning, I have not taken them, VarName1 is the imported column vector's name
numberOfPeaks = zeros(30,1,'int8');
pks = findpeaks(VarName1); % VarName1(1:1000,:) (when no filter applied)
numberOfPeaks(1) = size(pks,1);
for i=2:30
h = 1/i*ones(1,i,'double');
y = filter(h,1,VarName1); % VarName1(1:1000,:)
numberOfPeaks(i) = size(findpeaks(y),1);
end
plot(1:30,numberOfPeaks);
I expect a plot like this when whole the data is given:
but I get:
I realised that the problem is int8 I use. It can only take up to 127 and this caused my big results to be as 127.
Turning it into double solves the problem.
I am trying to find the Mean of three cycles after the signal become periodic and reach to steady state. I have a signal that is not periodic at the beginning but after some time it became periodic. I want to find the Mean of the next three cycles which each cycle has five points.
Now I did that by opening the plot and find the point where the signal become periodic then I enter that point to MATLAB, then I got the results. The program working fine but I have a big problem. I have 500,000 data records and its impossible to open each one and find the starting point where the signal become periodic. Is there any way that I can find starting point without opening the plot because each case has a different starting point where the signal become periodic?
I used below code now
close all,clear variables,clear all;
clc;
prompt = 'Enter Strating Point?';
N= input(prompt);
Result=mean(mean(1,N:N+4)+mean(1,N+5:N+9)+mean(1,N+10:N+14));
I attached sample of data, Column one is the signal and column two is the time.
https://www.dropbox.com/sh/27lebrp1lwnmm3l/AABIhN1tzUSJQjjED954Yvyka?dl=0
Thank you!
Full edit:
%inputs: time and y (the response), both same length vectors
ppc = 5; % points per cycle
A = zeros(ppc,1);
for i = 1:ppc
A(i) = mean(y(i:ppc:length(y)));
end
[~,b] = min(A);
possidx = (length(time)+b-ppc):-ppc:b; %idx of lowest points
lowlist = fliplr(y(possidx));% lowest points
for i = 2:length(lowlist) %start from behind
se = std(lowlist(1:i))/sqrt(i); %calculate SE for all current points
if se > 0.05 %depending on your filed you might wanna change it to a lower value
periodstart = time(possidx(i-1)); %lowest point of first period
break
end
end
What it does: the first loop finds which group of points is always at the bottom. So adjust ppc to 10 if you have 10 points per cycle. The points per cycle don't have to be exactly the same for each cycle if you have a lot of them, it should still be reasonably accurate.
Then we add from behind one by one these lowest points and calculate the standard error. Once it is greater than 0.05 we are outside of the periods.
I felt so free to use standard error because that is something i know and that makes sense in this situation. I set the threshold to 0.05 because it's standard in many fields, alter it if it is different in your field.
I have this piece of code
N=10^4;
for i = 1:N
[E,X,T] = fffun(); % Stochastic simulation. Returns every time three different vectors (whose length is 10^3).
X_(i,:)=X;
T_(i,:)=T;
GRID=[GRID T];
end
GRID=unique(GRID);
% Second part
for i=1:N
for j=1:(kmax)
f=find(GRID==T_(i,j) | GRID==T_(i,j+1));
s=f(1);
e=f(2)-1;
counter(X_(i,j), s:e)=counter(X_(i,j), s:e)+1;
end
end
The code performs N different simulations of a stochastic process (which consists of 10^3 events, occurring at discrete moments (T vector) that depends on the specific simulation.
Now (second part) I want to know, as a function of time istant, how many simulations are in a particular state (X assumes value between 1 and 10). The idea I had: create a grid vector with all the moments at which something happens in any simulation. Then, looping over the simulations, loop over the timesteps in which something happens and incrementing all the counter indeces that corresponds to this particular slice of time.
However this second part is very heavy (I mean days of processing on a standard quad-core CPU). And it shouldn't.
Are there any ideas (maybe about comparing vectors in a more efficient way) to cut the CPU time?
This is a standalone 'second_part'
N=5000;
counter=zeros(11,length(GRID));
for i=1:N
disp(['Counting sim #' num2str(i)]);
for j=1:(kmax)
f=find(GRID==T_(i,j) | GRID==T_(i,j+1),2);
s=f(1);
e=f(2)-1;
counter(X_(i,j), s:e)=counter(X_(i,j), s:e)+1;
end
end
counter=counter/N;
stop=find(GRID==Tmin);
stop=stop-1;
plot(counter(:,(stop-500):stop)')
with associated dummy data ( filedropper.com/data_38 ). In the real context the matrix has 2x rows and 10x columns.
Here is what I understand:
T_ is a matrix of time steps from N simulations.
X_ is a matrix of simulation state at T_ in those simulations.
so if you do:
[ut,~,ic]= unique(T_(:));
you get ic which is a vector of indices for all unique elements in T_. Then you can write:
counter = accumarray([ic X_(:)],1);
and get counter with no. of rows as your unique timesteps, and no. of columns as the unique states in X_ (which are all, and must be, integers). Now you can say that for each timestep ut(k) the number of time that the simulation was in state m is counter(k,m).
In your data, the only combination of m and k that has a value greater than 1 is (1,1).
Edit:
From the comments below, I understand that you record all state changes, and the time steps when they occur. Then every time a simulation change a state you want to collect all the states from all simulations and count how many states are from each type.
The main problem here is that your time is continuous, so basically each element in T_ is unique, and you have over a million time steps to loop over. Fully vectorizing such a process will need about 80GB of memory which will probably stuck your computer.
So I looked for a combination of vectorizing and looping through the time steps. We start by finding all unique intervals, and preallocating counter:
ut = unique(T_(:));
stt = 11; % no. of states
counter = zeros(stt,numel(ut));r = 1:size(T_,1);
r = 1:size(T_,1); % we will need that also later
Then we loop over all element in ut, and each time look for the relevant timestep in T_ in all simulations in a vectorized way. And finally we use histcounts to count all the states:
for k = 1:numel(ut)
temp = T_<=ut(k); % mark all time steps before ut(k)
s = cumsum(temp,2); % count the columns
col_ind = s(:,end); % fins the column index for each simulation
% convert the coulmns to linear indices:
linind = sub2ind(size(T_),r,col_ind.');
% count the states:
counter(:,k) = histcounts(X_(linind),1:stt+1);
end
This takes about 4 seconds at my computer for 1000 simulations, so it adds to a little more than one hour for the whole process. Not very quick...
You can try also one or two of the tweaks below to squeeze run time a little bit more:
As you can read here, accumarray seems to work faster in small arrays then histcouns. So may want to switch to it.
Also, computing linear indices directly is a quicker method than sub2ind, so you may want to try that.
implementing these suggestions in the loop above, we get:
R = size(T_,1);
r = (1:R).';
for k = 1:K
temp = T_<=ut(k); % mark all time steps before ut(k)
s = cumsum(temp,2); % count the columns
col_ind = s(:,end); % fins the column index for each simulation
% convert the coulmns to linear indices:
linind = R*(col_ind-1)+r;
% count the states:
counter(:,k) = accumarray(X_(linind),1,[stt 1]);
end
In my computer switching to accumarray and or removing sub2ind gain a slight improvement but it was not consistent (using timeit for testing on 100 or 1K elements in ut), so you better test it yourself. However, this still remains very long.
One thing that you may want to consider is trying to discretize your timesteps, so you will have much less unique elements to loop over. In your data about 8% of the time intervals a smaller than 1. If you can assume that this is short enough to be treated as one time step, then you could round your T_ and get only ~12.5K unique elements, which take about a minute to loop over. You can do the same for 0.1 intervals (which are less than 1% of the time intervals), and get 122K elements to loop over, what will take about 8 hours...
Of course, all the timing above are rough estimates using the same algorithm. If you do choose to round the times there may be even better ways to solve this.
I would like to identify the largest possible contiguous subsample of a large data set. My data set consists of roughly 15,000 financial time series of up to 360 periods in length. I have imported the data into MATLAB as a 360 by 15,000 numerical matrix.
This matrix contains a lot of NaNs due to some of the financial data not being available for the entire period. In the illustration, NaN entries are shown in dark blue, and non-NaN entries appear in light blue. It is these light blue non-NaN entries which I would like to ideally combine into an optimal subsample.
I would like to find the largest possible contiguous block of data that is contained in my matrix, while ensuring that my matrix contains a sufficient number of periods.
In a first step I would like to sort my matrix from left to right in descending order by the number of non-NaN entries in each column, that is, I would like to sort by the vector obtained by entering sum(~isnan(data),1).
In a second step I would like to find the sub-array of my data matrix that is at least 72 entries along the first dimension and is otherwise as large as possible, measured by the total number of entries.
What is the best way to implement this?
A big warning (may or may not apply depending on context)
As Oleg mentioned, when an observation is missing from a financial time series, it's often missing for reason: eg. the entity went bankrupt, the entity was delisted, or the instrument did not trade (i.e. illiquid). Constructing a sample without NaNs is likely equivalent to constructing a sample where none of these events occur!
For example, if this were hedge fund return data, selecting a sample without NaNs would exclude funds that blew up and ceased trading. Excluding imploded funds would bias estimates of expected returns upwards and estimates of variance or covariance downwards.
Picking a sample period with the fewest time series with NaNs would also exclude periods like the 2008 financial crisis, which may or may not make sense. Excluding 2008 could lead to an underestimate of how haywire things could get (though including it could lead to overestimate the probability of certain rare events).
Some things to do:
Pick a sample period as long as possible but be aware of the limitations.
Do your best to handle survivorship bias: eg. if NaNs represent delisting events, try to get some kind of delisting return.
You almost certainly will have an unbalanced panel with missing observations, and your algorithm will have to be deal with that.
Another general finance / panel data point, selecting a sample at some time point t and then following it into the future is perfectly ok. But selecting a sample based upon what happens during or after the sample period can be incredibly misleading.
Code that does what you asked:
This should do what you asked and be quite fast. Be aware of the problems though if whether an observation is missing is not random and orthogonal to what you care about.
Inputs are a T by n sized matrix X:
T = 360; % number of time periods (i.e. rows) in X
n = 15000; % number of time series (i.e. columns) in X
T_subsample = 72; % desired length of sample (i.e. rows of newX)
% number of possible starting points for series of length T_subsample
nancount_periods = T - T_subsample + 1;
nancount = zeros(n, nancount_periods, 'int32'); % will hold a count of NaNs
X_isnan = int32(isnan(X));
nancount(:,1) = sum(X_isnan(1:T_subsample, :))'; % 'initialize
% We need to obtain a count of nans in T_subsample sized window for each
% possible time period
j = 1;
for i=T_subsample + 1:T
% One pass: add new period in the window and subtract period no longer in the window
nancount(:,j+1) = nancount(:,j) + X_isnan(i,:)' - X_isnan(j,:)';
j = j + 1;
end
indicator = nancount==0; % indicator of whether starting_period, series
% has no NaNs
% number of nonan series of length T_subsample by starting period
max_subsample_size_by_starting_period = sum(indicator);
max_subsample_size = max(max_subsample_size_by_starting_period);
% find the best starting period
starting_period = find(max_subsample_size_by_starting_period==max_subsample_size, 1);
ending_period = starting_period + T_subsample - 1;
columns_mask = indicator(:,starting_period);
columns = find(columns_mask); %holds the column ids we are using
newX = X(starting_period:ending_period, columns_mask);
Here's an idea,
Assuming you can rearrange the series, calculate the distance (you decide the metric, but if looking at is nan vs not is nan, Hamming is ok).
Now hierarchically cluster the series and rearrange them using either a dendrogram
or http://www.mathworks.com/help/bioinfo/examples/working-with-the-clustergram-function.html
You should probably prune any series that doesn't have a minimum number of non nan values before you start.
First I have only little insight in financial mathematics. I understood it that you want to find the longest continuous chain of non-NaN values for each time series. The time series should be sorted depending on the length of this chain and each time series, not containing a chain above a threshold, discarded. This can be done using
data = rand(360,15e3);
data(abs(data) <= 0.02) = NaN;
%% sort and chop data based on amount of consecutive non-NaN values
binary_data = ~isnan(data);
% find edges, denote their type and calculate the biggest chunk in each
% column
edges = [2*binary_data(1,:)-1; diff(binary_data, 1)];
chunk_size = diff(find(edges));
chunk_size(end+1) = numel(edges)-sum(chunk_size);
[row, ~, id] = find(edges);
num_row_elements = diff(find(row == 1));
num_row_elements(end+1) = numel(chunk_size) - sum(num_row_elements);
%a chunk of NaN has a -1 in id, a chunk of non-NaN a 1
chunks_per_row = mat2cell(chunk_size .* id,num_row_elements,1);
% sort by largest consecutive block of non-NaNs
max_size = cellfun(#max, chunks_per_row);
[max_size_sorted, idx] = sort(max_size, 'descend');
data_sorted = data(:,idx);
% remove all elements that only have block sizes smaller then some number
some_number = 20;
data_sort_chop = data_sorted(:,max_size_sorted >= some_number);
Note that this can be done a lot simpler, if the order of periods within a time series doesn't matter, aka data([1 2 3],id) and data([3 1 2], id) are identical.
What I do not know is, if you want to discard all periods within a time series that don't correspond to the biggest value, get all those chains as individual time series, ...
Feel free to drop a comment if it has to be more specific.
I am recording voltage changes over a small circuit- this records mouse feeding. When the mouse is eating, the circuit voltage changes, I convert that into ones and zeroes, all is well.
BUT- I want to calculate the number and duration of 'bursts' of feeding- that is, instances of circuit closing that occur within 250 ms (75 samples) of one another. If the gap between closings is larger than 250ms I want to count it as a new 'burst'
I guess I am looking for help in asking matlab to compare the sample number of each 1 in the digital file with the sample number of the next 1 down- if the difference is more than 75, call the first 1 the end of one bout and the second one the start of another bout, classifying the difference as a gap, but if it is NOT, keep the sample number of the first 1 and compare it against the next and next and next until there is a 75-sample difference
I can compare each 1 to the next 1 down:
n=1; m=2;
for i = 1:length(bouts4)-1
if bouts4(i+1) - bouts4(i) >= 75 %250 msec gap at a sample rate of 300
boutend4(n) = bouts4(i);
boutstart4(m)= bouts4(i+1);
m = m+1;
n = n+1;
end
I don't really want to iterate through i for both variables though...
any ideas??
-DB
You can try the following code
time_diff = diff(bouts4);
new_feeding = time_diff > 75;
boutend4 = bouts4(new_feeding);
boutstart4 = [0; bouts4(find(new_feeding) + 1)];
That's actually not too bad. We can actually make this completely vectorized. First, let's start with two signals:
A version of your voltages untouched
A version of your voltages that is shifted in time by 1 step (i.e. it starts at time index = 2).
Now the basic algorithm is really:
Go through each element and see if the difference is above a threshold (in your case 75).
Enumerate the locations of each one in separate arrays
Now onto the code!
%// Make those signals
bout4a = bouts4(1:end-1);
bout4b = bouts4(2:end);
%// Ensure column vectors - you'll see why soon
bout4a = bout4a(:);
bout4b = bout4b(:);
% // Step #1
loc = find(bouts4b - bouts4a >= 75);
% // Step #2
boutend4 = [bouts4(loc); 0];
boutstart4 = [0; bouts4(loc + 1)];
Aside:
Thanks to tail.b.lo, you can also use diff. It basically performs that difference operation with the copying of those vectors like I did before. diff basically works the same way. However, I decided not to use it so you can see how exactly your code that you wrote translates over in a vectorized way. Only way to learn, right?
Back to it!
Let's step through this slowly. The first two lines of code make those signals I was talking about. An original one (up to length(bouts) - 1) and another one that is the same length but shifted over by one time index. Next, we use find to find those time slots where the time index was >= 75. After, we use these locations to access the bouts array. The ending array accesses the original array while the starting array accesses the same locations but moved over by one time index.
The reason why we need to make these two signals column vector is the way I am appending information to the starting vector. I am not sure whether your data comes in rows or columns, so to make this completely independent of orientation, I'm going to make sure that your data is in columns. This is because if I try to append a 0, if I do it to a row vector I have to use a space to denote that I'm going to the next column. If I do it for a column vector, I have to use a semi-colon to go to the next row. To completely avoid checking to see whether it's a row or column vector, I'm going to make sure that it's a column vector no matter what.
By looking at your code m=2. This means that when you start writing into this array, the first location is 0. As such, I've artificially placed a 0 at the beginning of this array and followed that up with the rest of the values.
Hope this helps!