I have a file with a huge amount of data. Places where there is no information about prices are marked as NaN. I would like to delete all rows, where there are such names and delete all columns where there are a lot of missing data (because I need then proportional matrix).
I also have another string (AssetList) where there is information about all tickers. If column will be deleted, it’s necessary to delete according ticker there.
I would much appreciate any help.
Data:
6,41 16,51 x x 69,78
6,22 16 x x 68,48
6,17 15,61 x x 69,46
x x x x x
x x x x x
x x x x x
5,83 15,14 x x 69,85
6,4 17,64 x x 71,03
6,07 16,04 x x 68,64
5,91 17,09 x x 68,92
6 18,19 x x 68,72
x x x x x
x x x x x
5,58 17,17 x x 69,02
5,3 16,83 x x 67,69
5,66 19,65 x x 68,64
5,65 20,86 x x 69,45
5,43 20,46 x x 68,94
x x x x x
x x x x x
5,58 2 0,16 x 68,73
AssetList:
FLWS SRCE FUBC DDD MMM
I'll have to make some assumptions here, as I didn't fully understand your question.
The following first deletes all rows that exist of NaN exclusively, and continues by deleting all columns that contain at least one NaN:
M = [ ...
6.41 16.51 NaN NaN 69.78
6.22 16 NaN NaN 68.48
6.17 15.61 NaN NaN 69.46
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
5.83 15.14 NaN NaN 69.85
6.4 17.64 NaN NaN 71.03
6.07 16.04 NaN NaN 68.64
5.91 17.09 NaN NaN 68.92
6 18.19 NaN NaN 68.72
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
5.58 17.17 NaN NaN 69.02
5.3 16.83 NaN NaN 67.69
5.66 19.65 NaN NaN 68.64
5.65 20.86 NaN NaN 69.45
5.43 20.46 NaN NaN 68.94
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
5.58 2 0.16 NaN 68.73];
AssetList = {
'FLWS' 'SRCE' 'FUBC' 'DDD' 'MMM' };
% Delete all-NaN rows
M(all(isnan(M),2),:) = [];
% Delete any-NaN columns
colsToBeDeleted = any(isnan(M));
M(:, colsToBeDeleted) = []
AssetList(colsToBeDeleted) = []
Result:
M =
6.4100 16.5100 69.7800
6.2200 16.0000 68.4800
6.1700 15.6100 69.4600
5.8300 15.1400 69.8500
6.4000 17.6400 71.0300
6.0700 16.0400 68.6400
5.9100 17.0900 68.9200
6.0000 18.1900 68.7200
5.5800 17.1700 69.0200
5.3000 16.8300 67.6900
5.6600 19.6500 68.6400
5.6500 20.8600 69.4500
5.4300 20.4600 68.9400
5.5800 2.0000 68.7300
AssetList =
'FLWS' 'SRCE' 'MMM'
Related
I am looking for algorithm (effective + vectorized) how to find histogram of gaps (NaN) width in the following manner:
signals are represented by (Nsamples x Nsig) array
gaps in signal are encoded by NaN's
width of gaps: is number of consecutive NaN's in the signal
gaps width histogram: is frequency of gaps with specific widths in signals
And the following conditions are fullfilled:
[Nsamples,Nsig ]= size(signals)
isequal(size(signals),size(gapwidthhist)) % true
isequal(sum(gapwidthhist.*(1:Nsamples)',1),sum(isnan(signals),1)) % true
Of course, compressed form of gapwidthhist (represented by two cells: "gapwidthhist_compressed_widths" and "gapwidthhist_compressed_freqs") is required too.
Example:
signals = [1.1 NaN NaN NaN -1.4 NaN 8.3 NaN NaN NaN NaN 1.5 NaN NaN; % signal No. 1
NaN 2.2 NaN 4.9 NaN 8.2 NaN NaN NaN NaN NaN 2.4 NaN NaN]' % signal No. 2
gapwidthhist = [1 1 1 1 0 0 0 0 0 0 0 0 0 0; % gap histogram for signal No. 1
3 1 0 0 1 0 0 0 0 0 0 0 0 0]' % gap histogram for signal No. 2
where integer histogram bins (gap widths) are 1:Nsamples (Nsamples=14).
Coresponding compressed gap histogram looks like:
gapwidthhist_compressed_widths = cell(1,Nsig)
gapwidthhist_compressed_widths =
1×2 cell array
{[1 2 3 4]} {[1 2 5]}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
gapwidthhist_compressed_freqs = cell(1, Nsig)
gapwidthhist_compressed_freqs =
1×2 cell array
{[1 1 1 1]} {[3 1 1]}
Typical problem dimension:
Nsamples = 1e5 - 1e6
Nsig = 1e2 - 1e3.
Thanks in advance for any help.
Added remark: My so far best solution is the following code:
signals = [1.1 NaN NaN NaN -1.4 NaN 8.3 NaN NaN NaN NaN 1.5 NaN NaN; % signal No. 1
NaN 2.2 NaN 4.9 NaN 8.2 NaN NaN NaN NaN NaN 2.4 NaN NaN; % signal No. 2
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN]' % signal No. 3
[numData, numSignals] = size(signals)
gapwidthhist = zeros(numData, numSignals);
for column = 1 : numSignals
thisSignal = signals(:, column); % Extract this column.
% Find lengths of all NAN runs
props = regionprops(isnan(thisSignal), 'Area');
allLengths = [props.Area]
edges = [1:max(allLengths), inf]
hc = histcounts(allLengths, edges)
% Load up gapwidthhist
for k2 = 1 : length(hc)
gapwidthhist(k2, column) = hc(k2);
end
end
% What it is:
gapwidthhist'
But I am looking mainly for pure Matlab code without any built-in matlab functions (like "regionprops" from Image Processing Toolbox)!!!
This is much more simple Matlab implementation but still not optimal (+ not vectorized):
signals = [1.1 NaN NaN NaN -1.4 NaN 8.3 NaN NaN NaN NaN 1.5 NaN NaN; % signal No. 1
NaN 2.2 NaN 4.9 NaN 8.2 NaN NaN NaN NaN NaN 2.4 NaN NaN; % signal No. 2
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN]'; % signal No. 3
signals
[numData, numSignals] = size(signals);
gapwidthhist = zeros(numData, numSignals);
gaps = zeros(numData+1,numSignals);
auxnan = isnan(signals);
for i = 1:numSignals
c = 0;
for j = 1:numData
if auxnan(j,i)
c = c + 1;
else
gaps(j,i) = c;
c = 0;
end
end
gaps(numData+1,i) = c;
gapwidthhist(:,i) = histcounts(gaps(:,i),1:numData+1);
end
gapwidthhist
Thanks to #breaker for help.
Any idea how to optimize (vectorize) this code to be more effective?
Here is a slightly more vectorized version that may be a bit quicker. I use Octave, so I don't know how much MATLAB's JIT compiler will optimize the inner loop in the other approach.
% Set up the data
signals = [1.1 NaN NaN NaN -1.4 NaN 8.3 NaN NaN NaN NaN 1.5 NaN NaN; % signal No. 1
NaN 2.2 NaN 4.9 NaN 8.2 NaN NaN NaN NaN NaN 2.4 NaN NaN; % signal No. 2
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN]'; % signal No. 3
signals
[numData, numSignals] = size(signals);
gapwidthhist = zeros(numData, numSignals);
gaps = zeros(numData+1,numSignals);
auxnan = ~isnan(signals); % We want non-NaN values to be 1
for i = 1:numSignals
difflist = diff(find([1; auxnan(:,i); 1])) - 1; % get the gap lengths
gapList = difflist(find(difflist)); % keep only the non-zero gaps
for c = gapList.' % need row vector to loop over elements
gapwidthhist(c,i) = gapwidthhist(c,i) + 1; % each gap length increments the histogram
end
end
gapwidthhist
Here's the program flow:
First, negate the auxnan array so that NaN is 0 and non-NaN is 1.
In the outer loop, pad each column with 1's on top and bottom to capture strings of NaN at the beginning and end of the signal.
Use find to get the indices of the 1 (non-NaN) elements.
Take the diff of the indices.
A diff of 1 means no gap and a diff greater than 1 gives the length of the gap plus 1, so subtract 1 from the diff result.
Use the results (indices) of find to get the values of the nonzero elements. These are the gap widths.
Now loop through the values and accumulate the results in the histogram. You might try replacing this inner loop with accumarray to see if that speeds things up any.
May be final solution:
signals = [1.1 NaN NaN NaN -1.4 NaN 8.3 NaN NaN NaN NaN 1.5 NaN NaN; % signal No. 1
NaN 2.2 NaN 4.9 NaN 8.2 NaN NaN NaN NaN NaN 2.4 NaN NaN; % signal No. 2
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN]'; % signal No. 3
signals
[numData, numSignals] = size(signals);
gapwidthhist = zeros(numData, numSignals);
auxnan = isnan(signals);
for i = 1:numSignals
c = 0;
for j = 1:numData
if auxnan(j,i)
c = c + 1;
else
if c > 0
gapwidthhist(c,i) = gapwidthhist(c,i) + 1;
c = 0;
end
end
end
if c > 0
gapwidthhist(c,i) = gapwidthhist(c,i) + 1;
end
end
gapwidthhist
Open question: how to modify the code where outer for-loop should be able to use parfor-loop?
I am trying to create a table that is 10 x 5 with only NaNs. I start by creating an array with NaNs:
N = NaN(10, 5);
then I try converting it to a table:
T = table(N);
It puts all cells into one column, but I need the table to be 5 columns with one NaN in each cell. Does anyone know how to do that?
array2table
works just fine. This takes a matrix and converts it to the table structure where each column of the matrix is a column in the output table:
>> N = NaN(10, 5);
>> T = array2table(N)
T =
N1 N2 N3 N4 N5
___ ___ ___ ___ ___
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
What you want is:
t = array2table(NaN(10,5))
Bonus (so our answers are slightly different :P) You can rename the variables to anything you want with something like:
t.Properties.VariableNames = {'x1','x2','x3','x4','x5'};
I have a time stamp as follow.
Time =
243.0000
243.0069
243.0139
243.0208
243.0278
243.0347
243.0417
243.0486
243.0556
243.0625
243.0694
243.0764
243.0833
243.0903
243.0972
243.1042
243.1111
243.1181
243.1250
243.1319
243.1389
243.1458
243.1528
243.1597
243.1667
243.1736
243.1806
243.1875
243.1944
Now I have another two column vector.
ab =
243.0300 0.5814
243.0717 0.6405
243.1134 0.6000
243.1550 0.5848
243.1967 0.5869
First column is 'Time2' and second column is 'Conc'.
Time2 = ab(:,1);
Conc = ab(:,2);
Now I want to match 'Conc' based on 'Time2' with 'Time' but only filling with 'NaN'. Also 'Time2' is not exactly as 'Time'. I can use something like following
Conc_interpolated = interp1(Time2,Conc,Time)
but it does an interpolation with artificial data. I only want to match vector length by filling with 'NaN' in 'Conc' not with interpolated data.Any recommendations? Thanks
I try to guess what you want:
you have time vector A:
TimeA = ...
[243.0000;
243.0069;
...
243.1875;
243.1944];
and probably some data A:
DataA = rand(length(TimeA),1);
now you want to implement your second time vector B:
TimeB = ...
[243.0300;
243.0717;
243.1134;
243.1550;
243.1967];
and the according data:
DataB = ...
[0.5814;
0.6405;
0.6000;
0.5848;
0.5869];
finally everything should be merged together and sorted:
X = [ TimeA, DataA , NaN(size(DataA)) ;
TimeB, NaN(size(DataB)) , DataB ]
Y = sortrows(X,1);
results to:
Y =
243.0000 0.8852 NaN
243.0069 0.9133 NaN
243.0139 0.7962 NaN
243.0208 0.0987 NaN
243.0278 0.2619 NaN
243.0300 NaN 0.5814
243.0347 0.3354 NaN
243.0417 0.6797 NaN
243.0486 0.1366 NaN
243.0556 0.7212 NaN
243.0625 0.1068 NaN
243.0694 0.6538 NaN
243.0717 NaN 0.6405
243.0764 0.4942 NaN
243.0833 0.7791 NaN
243.0903 0.7150 NaN
243.0972 0.9037 NaN
243.1042 0.8909 NaN
243.1111 0.3342 NaN
243.1134 NaN 0.6000
243.1181 0.6987 NaN
243.1250 0.1978 NaN
243.1319 0.0305 NaN
243.1389 0.7441 NaN
243.1458 0.5000 NaN
243.1528 0.4799 NaN
243.1550 NaN 0.5848
243.1597 0.9047 NaN
243.1667 0.6099 NaN
243.1736 0.6177 NaN
243.1806 0.8594 NaN
243.1875 0.8055 NaN
243.1944 0.5767 NaN
243.1967 NaN 0.5869
is that right?
My understanding is a little different, it doesn't add to Time but rather assigns each Conc to the nearst Time based on it's Time2:
ind = zeros(size(ab,1),1); %//preallocate memory
for ii = 1:size(ab,1)
[~, ind(ii)] = min(abs(ab(ii,1)-Time)); %//Based on this FEX entry: http://www.mathworks.com/matlabcentral/fileexchange/30029-findnearest-algorithm/content/findNearest.m
end
Time(:,2) = NaN; %// Prefill with NaN
Time(ind, 2) = ab(:,2)
This results in:
Time =
243.00000 NaN
243.00690 NaN
243.01390 NaN
243.02080 NaN
243.02780 0.58140
243.03470 NaN
243.04170 NaN
243.04860 NaN
243.05560 NaN
243.06250 NaN
243.06940 0.64050
243.07640 NaN
243.08330 NaN
243.09030 NaN
243.09720 NaN
243.10420 NaN
243.11110 0.60000
243.11810 NaN
243.12500 NaN
243.13190 NaN
243.13890 NaN
243.14580 NaN
243.15280 0.58480
243.15970 NaN
243.16670 NaN
243.17360 NaN
243.18060 NaN
243.18750 NaN
243.19440 0.58690
for your example inputs
This is easy in two dimensions, for example:
>> A = NaN(5,4)
>> A(2:4,2:3) = [1 2; 3 4; 5 6]
>> A(2,2) = NaN
>> A(4,3) = NaN
A =
NaN NaN NaN NaN
NaN NaN 2 NaN
NaN 3 4 NaN
NaN 5 NaN NaN
NaN NaN NaN NaN
>> A(~all(isnan(A),2),~all(isnan(A),1))
ans =
NaN 2
3 4
5 NaN
Note that NaN values in rows and columns that are not all NaN are retained.
How to expand this to multiple dimensions? For example if A has three dimensions:
>> A = NaN(5,4,3)
>> A(2:4,2:3,2) = [1 2; 3 4; 5 6]
>> A(2,2,2) = NaN
>> A(4,3,2) = NaN
A(:,:,1) =
NaN NaN NaN NaN
NaN NaN NaN NaN
NaN NaN NaN NaN
NaN NaN NaN NaN
NaN NaN NaN NaN
A(:,:,2) =
NaN NaN NaN NaN
NaN NaN 2 NaN
NaN 3 4 NaN
NaN 5 NaN NaN
NaN NaN NaN NaN
A(:,:,3) =
NaN NaN NaN NaN
NaN NaN NaN NaN
NaN NaN NaN NaN
NaN NaN NaN NaN
NaN NaN NaN NaN
How do I then get
ans =
NaN 2
3 4
5 NaN
I'd like to do this in four dimensions, and with much larger matrixes than the example matrix A here.
My solution to the problem based on the input A as posted by OP:
>> [i,j,k] = ind2sub(size(A),find(~isnan(A)));
>> l = min([i j k]);
>> u = max([i j k]);
>> B=A(l(1):u(1),l(2):u(2),l(3):u(3))
B =
NaN 2
3 4
5 NaN
>> size(B)
ans =
3 2
Since you stated that you want to do this on much larger matrices I'm not sure about the performance of #ronalchn's solution - that is all the all-calls. But I have no idea to what extend that matters - maybe someone can comment...
Try this:
2 dimensions
A(~all(isnan(A),2),~all(isnan(A),1))
3 dimensions
A(~all(all(isnan(A),2),3),...
~all(all(isnan(A),1),3),...
~all(all(isnan(A),1),2))
4 dimensions
A(~all(all(all(isnan(A),2),3),4),...
~all(all(all(isnan(A),1),3),4),...
~all(all(all(isnan(A),1),2),4),...
~all(all(all(isnan(A),1),2),3))
Basically, the rule is for N dimensions:
on all N dimensions you do the isnan() thing.
Then wrap it in with the all() function N-1 times,
and the 2nd argument each of the all() functions for the ith dimension should be numbers 1 to N in any order, but excluding i.
Since Theodros Zelleke wants to see whose method is faster (nice way of saying he thinks his method is so fast), here's a benchmark. Matrix A defined as:
A = NaN*ones(100,400,3,3);
A(2:4,2:3,2,2) = [1 2; 3 4; 5 6];
A(2,2,2,2) = NaN;A(4,3,2,2) = NaN;
A(5:80,4:200,2,2)=ones(76,197);
His test defined as:
tic;
for i=1:100
[i,j,k,z] = ind2sub(size(A),find(~isnan(A)));
l = min([i j k z]);
u = max([i j k z]);
B=A(l(1):u(1),l(2):u(2),l(3):u(3),l(4):u(4));
end
toc
With results:
Elapsed time is 0.533932 seconds.
Elapsed time is 0.519216 seconds.
Elapsed time is 0.575037 seconds.
Elapsed time is 0.525000 seconds.
My test defined as:
tic;
for i=1:100
isnanA=isnan(A);
ai34=all(all(isnanA,3),4);
ai12=all(all(isnanA,1),2);
B=A(~all(ai34,2),~all(ai34,1),~all(ai12,4),~all(ai12,3));
end
toc
With results:
Elapsed time is 0.224869 seconds.
Elapsed time is 0.225132 seconds.
Elapsed time is 0.246762 seconds.
Elapsed time is 0.236989 seconds.
I have lots of vectors like this one below, very sparse, lots of 'NaN'. What I intend to do is to extract the valid number out of this vector, and put them into a separate vector with no 'NaN' values.
And every vector has different positions with valid number, so I can't put them into a matrix then extract rows.
Thus please help me with this!
10459865
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
8751943
NaN
NaN
NaN
NaN
NaN
NaN
6951680
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
5991217
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
5327653
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
4740048
NaN
NaN
4265221
NaN
NaN
3973280
Assuming that vector is stored in variable a,
a(isfinite(a))
will extract just the valid (finite) entries.
You can use the isnan() function to find out if an entry is a number. Then something like
x = vector of values;
new_x = x(~isnan(x));
new_x is a vector with only the valid numbers.