Extracting last (non NaN) 200 columns from matrix with varying number of NaNs ending the rows - matlab

I have pupil size data from an eye tracking experiment. In the experiment, to start each trial the participant looks in the centre of the screen for 1000 ms (prefixation period), then the trial begins. If they look away or blink, the 1000 ms period restarts. So, each trial has a different length prefixation period. When I create a matrix (with each row being a different trial, and each column being a pupil size sample over time) it creates the matrix with the number of columns based on the trial with the longest prefixation period, and then adds varying numbers of NaNs to the end of each other row. I need to extract the last 200 samples (columns) of each trial, but these are not the last 200 columns of the matrix because of the additional NaNs that are added.
At the moment I have this:
Row1 = PreFixBase(1,:); % extract the first row
Row1(isnan(Row1)) = []; %get rid of the NaNs
Row1Base = Row1(end-200+1:end); %extract the last 200 samples / columns
which I do for each row separately and then paste them back together. It works, but is really inefficient (I have 324 rows / trials) and I'm sure there must be a more concise way of doing this, but haven't been able to find the answer.
Any help appreciated.
Amy

You can use cumsum + logical indexing to extract the desired elements
Base = PreFixBase.'; % transpose the matrix
S=cumsum(~isnan(Base),1,'reverse'); % number last non NaN columns from 1 to 200 (from end to begin)
Result = reshape(Base(S> 0 & S<=200),200,[]).'; % extract data and reshape to the correct size

Related

Why number of peaks of my signal stay same when I increase n in n-point moving average filter when data is big?

I am using MATLAB to find the number of peaks of a signal.
I'm trying to plot the number of peaks of a signal filtered with N-point moving average filter, N goes from 2 to 30.(I also consider the number of peaks when no filter has applied at the beginning of the resulting array) My data array(imported from csv and has double values between 0 and 1) has around 50k points. When I give part of the data i.e 100, 500 or 1000 points, using array slicing, # of peaks decrease as expected. However, when I give the whole data or even 2000 points, the number of peaks stays same at 127.
I changed the number of data given to the filter to find out why this happens. I changed the commented lines like showed in the comment and tried. When less than 1000 data points given plot was fine.
Here is the signal
https://www.dropbox.com/s/e1bkcjn5ta5q610/exampleSignal.csv?dl=0
Please import it from 4th element to end, it has some strange data at the beginning, I have not taken them, VarName1 is the imported column vector's name
numberOfPeaks = zeros(30,1,'int8');
pks = findpeaks(VarName1); % VarName1(1:1000,:) (when no filter applied)
numberOfPeaks(1) = size(pks,1);
for i=2:30
h = 1/i*ones(1,i,'double');
y = filter(h,1,VarName1); % VarName1(1:1000,:)
numberOfPeaks(i) = size(findpeaks(y),1);
end
plot(1:30,numberOfPeaks);
I expect a plot like this when whole the data is given:
but I get:
I realised that the problem is int8 I use. It can only take up to 127 and this caused my big results to be as 127.
Turning it into double solves the problem.

How to identify an optimal subsample from a data set with missing values in MATLAB

I would like to identify the largest possible contiguous subsample of a large data set. My data set consists of roughly 15,000 financial time series of up to 360 periods in length. I have imported the data into MATLAB as a 360 by 15,000 numerical matrix.
This matrix contains a lot of NaNs due to some of the financial data not being available for the entire period. In the illustration, NaN entries are shown in dark blue, and non-NaN entries appear in light blue. It is these light blue non-NaN entries which I would like to ideally combine into an optimal subsample.
I would like to find the largest possible contiguous block of data that is contained in my matrix, while ensuring that my matrix contains a sufficient number of periods.
In a first step I would like to sort my matrix from left to right in descending order by the number of non-NaN entries in each column, that is, I would like to sort by the vector obtained by entering sum(~isnan(data),1).
In a second step I would like to find the sub-array of my data matrix that is at least 72 entries along the first dimension and is otherwise as large as possible, measured by the total number of entries.
What is the best way to implement this?
A big warning (may or may not apply depending on context)
As Oleg mentioned, when an observation is missing from a financial time series, it's often missing for reason: eg. the entity went bankrupt, the entity was delisted, or the instrument did not trade (i.e. illiquid). Constructing a sample without NaNs is likely equivalent to constructing a sample where none of these events occur!
For example, if this were hedge fund return data, selecting a sample without NaNs would exclude funds that blew up and ceased trading. Excluding imploded funds would bias estimates of expected returns upwards and estimates of variance or covariance downwards.
Picking a sample period with the fewest time series with NaNs would also exclude periods like the 2008 financial crisis, which may or may not make sense. Excluding 2008 could lead to an underestimate of how haywire things could get (though including it could lead to overestimate the probability of certain rare events).
Some things to do:
Pick a sample period as long as possible but be aware of the limitations.
Do your best to handle survivorship bias: eg. if NaNs represent delisting events, try to get some kind of delisting return.
You almost certainly will have an unbalanced panel with missing observations, and your algorithm will have to be deal with that.
Another general finance / panel data point, selecting a sample at some time point t and then following it into the future is perfectly ok. But selecting a sample based upon what happens during or after the sample period can be incredibly misleading.
Code that does what you asked:
This should do what you asked and be quite fast. Be aware of the problems though if whether an observation is missing is not random and orthogonal to what you care about.
Inputs are a T by n sized matrix X:
T = 360; % number of time periods (i.e. rows) in X
n = 15000; % number of time series (i.e. columns) in X
T_subsample = 72; % desired length of sample (i.e. rows of newX)
% number of possible starting points for series of length T_subsample
nancount_periods = T - T_subsample + 1;
nancount = zeros(n, nancount_periods, 'int32'); % will hold a count of NaNs
X_isnan = int32(isnan(X));
nancount(:,1) = sum(X_isnan(1:T_subsample, :))'; % 'initialize
% We need to obtain a count of nans in T_subsample sized window for each
% possible time period
j = 1;
for i=T_subsample + 1:T
% One pass: add new period in the window and subtract period no longer in the window
nancount(:,j+1) = nancount(:,j) + X_isnan(i,:)' - X_isnan(j,:)';
j = j + 1;
end
indicator = nancount==0; % indicator of whether starting_period, series
% has no NaNs
% number of nonan series of length T_subsample by starting period
max_subsample_size_by_starting_period = sum(indicator);
max_subsample_size = max(max_subsample_size_by_starting_period);
% find the best starting period
starting_period = find(max_subsample_size_by_starting_period==max_subsample_size, 1);
ending_period = starting_period + T_subsample - 1;
columns_mask = indicator(:,starting_period);
columns = find(columns_mask); %holds the column ids we are using
newX = X(starting_period:ending_period, columns_mask);
Here's an idea,
Assuming you can rearrange the series, calculate the distance (you decide the metric, but if looking at is nan vs not is nan, Hamming is ok).
Now hierarchically cluster the series and rearrange them using either a dendrogram
or http://www.mathworks.com/help/bioinfo/examples/working-with-the-clustergram-function.html
You should probably prune any series that doesn't have a minimum number of non nan values before you start.
First I have only little insight in financial mathematics. I understood it that you want to find the longest continuous chain of non-NaN values for each time series. The time series should be sorted depending on the length of this chain and each time series, not containing a chain above a threshold, discarded. This can be done using
data = rand(360,15e3);
data(abs(data) <= 0.02) = NaN;
%% sort and chop data based on amount of consecutive non-NaN values
binary_data = ~isnan(data);
% find edges, denote their type and calculate the biggest chunk in each
% column
edges = [2*binary_data(1,:)-1; diff(binary_data, 1)];
chunk_size = diff(find(edges));
chunk_size(end+1) = numel(edges)-sum(chunk_size);
[row, ~, id] = find(edges);
num_row_elements = diff(find(row == 1));
num_row_elements(end+1) = numel(chunk_size) - sum(num_row_elements);
%a chunk of NaN has a -1 in id, a chunk of non-NaN a 1
chunks_per_row = mat2cell(chunk_size .* id,num_row_elements,1);
% sort by largest consecutive block of non-NaNs
max_size = cellfun(#max, chunks_per_row);
[max_size_sorted, idx] = sort(max_size, 'descend');
data_sorted = data(:,idx);
% remove all elements that only have block sizes smaller then some number
some_number = 20;
data_sort_chop = data_sorted(:,max_size_sorted >= some_number);
Note that this can be done a lot simpler, if the order of periods within a time series doesn't matter, aka data([1 2 3],id) and data([3 1 2], id) are identical.
What I do not know is, if you want to discard all periods within a time series that don't correspond to the biggest value, get all those chains as individual time series, ...
Feel free to drop a comment if it has to be more specific.

How do I efficiently multiply every 2 columns and sum the row

Maybe I should just go with a for loop but I want to see if there is a more efficient/faster way to do it.
I have a matrix of numbers, let's say 10x10. I want to multiply 1,1 by 1,2, then 1,3 times 1,4, etc and then sum those results for row 1. Then move to the next row and do the same thing. The end result would be a vector of 10.
It is possible for this matrix to be 1000x1000 so I want it to be as fast as possible. Thanks!
I would use
v = sum(M(:,1:2:end-1).*M(:,2:2:end),2);
Here M(:,1:2:end-1).*M(:,2:2:end) does multiplication: every element of an odd-numbered column of M is multiplied by its neighbor to the right. (This assumes even number of columns, otherwise the process you described is ill-defined.) Then every row is added up by the sum command.
On my computer, doing this for a 1000 by 1000 matrix takes 0.04 seconds.

how can i deal with empty matrix: 0-by-any number resulting from simulation?

I have made a simulation and the result each time of the simulation is a matrix and i choose a certain row from the matrix, so if the simulation run=500, i'll have a 500 matrix and ,the row i choose each time will be (at the end of the simulation) 500 rows [one row from the first matrix...last row from the last matrix]...
the problem is some times a matrix dose not contain the certain row i want , the answer is for example empty matrix: 0-by-6
i want to ignor this answer
Note: the row i choose is not necessary to be exist in all matrices
so if the run=600 , result in 600 matrix , the row i choose maybe =400 only and the other 200 will be zero
the simulation STOP when the result is empty matrix: 0-by-any number
I use Matlab
you can use isempty to detect empty arrays, for example
a=zeros(0,5)
isempty(a)
a =
Empty matrix: 0-by-5
ans =
1
For when the index exceeds matrix dimensions, you can add a condition that tests the size of your matrix, specifically, how man rows using size(m,1)
So all together, in your for loop you can code something like:
for n=1:blah
if ~isempty(M) % continue if matrix is non-empty
if size(M,1)<=n % continue if index doesn't exceeds matrix dimensions
....
....

How should I perform this binning and averaging in MATLAB?

I am trying to perform a binning average. I am using the code:
Avg = mean(reshape(a,300,144,27));
AvgF = squeeze(Avg);
The last line gets rid of singleton dimensions.
So as can be seen I am averaging over 300 points. It works fine except for times when I have a total number of points not equal to a multiple of 144*300.
Is there any way to make this binning average work even when the total number of points is not a multiple of 144*300?
EDIT: Sorry if my question sounded confusing. To clarify...
I have a file with 43200 rows and 27 columns. I am averaging by binning 300 rows at a time, which means in the end I am left with a matrix of size 144-by-27.
My code as I wrote it above works only when I have exactly 43200 rows. In some cases I have 43199, 43194, etc.. The reshape function works when I have a total number of rows that is a multiple of 300 (the bin size). Is there a way to make this binning average work when my total number of rows is not a multiple of 300?
I think I understand the problem better now...
If a is the data read from your file (of size N-by-27, where N is ideally 43,200), then I think you would want to do the following:
nRemove = rem(size(a,1),300); %# Find the number of points to remove
a = a(1:end-nRemove,:); %# Trim points to make an even multiple of 300
Avg = mean(reshape(a,300,[],27));
AvgF = squeeze(Avg);
This will remove points such that the number of rows in a will be a multiple of 300. Then your reshape and average should work. Note that I use [] in the call to RESHAPE, which lets it figure out what the number of column should be.