Counting frequency in a cell of char in Matlab: fast code?

Counting frequency in a cell of char in Matlab: fast code? - matlab

I have a 1x2 cell A in Matlab. A{i} is a cell of dimension 30494866x1 for i=1,2. A{i}(j) is a 1x21 char for i=1,2 and j=1,...,30494866.
For example I report here A{2}(1:3)
'116374117927631468606'
'112188647432305746617'
'116374117927631468606'
I want to count how many times each 1x21 char in A{2} is repeated. For example, just considering A{2}(1:3), I want to get
'116374117927631468606' 2
'112188647432305746617' 1
What I am doing at the moment is
a=unique(A{2},'stable');
b=cellfun(#(x) sum(ismember(A{2},x)),a);
However this is incredibly slow (running since yesterday). Do you have any suggestion on how I can speed up the code?

Since you want to know how many times each 21-char string used:
1) sort the cell
2) count how many times each string is used in a for loop.
Your code is O(n^2) so it's very slow. This should take less than a minute.
Based on your code
B=sort(A{2});
U=sort(unique(B));
C=zeros(numel(U),1);
cnt = 1;
for j=1:numel(B)
if strcmp(U(cnt),B(j))==1
C(cnt)=C(cnt)+1;
else
cnt = cnt +1;
if cnt <= numel(U)
C(cnt) = C(cnt)+1;
end
end
end

You can do this with the standard unique- accumarray couple:
data = {'116374117927631468606'
'112188647432305746617'
'116374117927631468606'};
[uu, ~, ww] = unique(data, 'stable');
count = accumarray(ww, 1);
result = [uu, num2cell(count)];
Or, a little more memory-efficient:
data = {'116374117927631468606'
'112188647432305746617'
'116374117927631468606'};
[~, vv, ww] = unique(data, 'stable');
count = accumarray(ww, 1);
result = [data(vv) num2cell(count)];

Related

Not sure what to do about error message "Conversion to double from cell is not possible."

I'm writing a program that finds the indices of a matrix G where there is only a single 1 for either a column index or a row index and removes any found index if it has a 1 for both the column and row index. Then I want to take these indices and use them as indices in an array U, which is where the trouble comes. The indices do not seem to be stored as integers and I'm not sure what they are being stored as or why. I'm quite new to Matlab (but thats probably obvious) and so I don't really understand how types work for Matlab or how they're assigned. So I'm not sure why I',m getting the error message mentioned in the title and I'm not sure what to do about it. Any assistance you can provide would be greatly appreciated.
I forgot to mention this before but G is a matrix that only contains 1s or 0s and U is an array of strings (i think what would be called a cell?)
function A = ISClinks(U, G)
B = [];
[rownum,colnum] = size(G);
j = 1;
for i=1:colnum
s = sum(G(:,i));
if s == 1
B(j,:) = i;
j = j + 1;
end
end
for i=1:rownum
s = sum(G(i,:));
if s == 1
if ismember(i, B)
B(B == i) = [];
else
B(j,:) = i;
j = j+1;
end
end
end
A = [];
for i=1:size(B,1)
s = B(i,:);
A(i,:) = U(s,:);
end
end
This is the problem code, but I'm not sure what's wrong with it.
A = [];
for i=1:size(B,1)
s = B(i,:);
A(i,:) = U(s,:);
end

Your program seems to be structured as though it had been written in a language like C. In MATLAB, you can usually substitute specialized functions (e.g. any() ) for low-level loops in many cases. Your function could be written more efficiently as:
function A = ISClinks(U, G)
% Find columns and rows that are set in the input
active_columns=any(G,1);
active_rows=any(G,2).';
% (Optional) Prevent columns and rows with same index from being simultaneously set
%exclusive_active_columns = active_columns & ~active_rows; %not needed; this line is only for illustrative purposes
%exclusive_active_rows = active_rows & ~active_columns; %same as above
% Merge column state vector and row state vector by XORing them
active_indices=xor(active_columns,active_rows);
% Select appropriate rows of matrix U
A=U(active_indices,:);
end
This function does not cause errors with the example input matrices I tested. If U is a cell array (e.g. U={'Lorem','ipsum'; 'dolor','sit'; 'amet','consectetur'}), then return value A will also be a cell array.

For parfor, vertical concatenation of table is too slow

I am suffering from speed issue on vertical concatenation of tables, each of which is a part of parfor loop.
Here's what I do:
T = table(group1, group2, val1, val2, ...);
T = sortrows(T, {'group1', 'group2'}); % possible speed gains from sorting?
[Pairs, idx] = unique([T.group1, T.group2], 'rows');
idx1 = [idx; height(T)+1]; % to include the last index
idx2 = idx1(2:end) - 1;
idx1 = idx1(1:end-1);
T_ = [];
parfor i=1:length(Pairs)
tmpIdx = idx1(i):idx2(i);
T_ = T(tmpIdx, :);
T__ = myFun(T_);
T_ = [T_; T__];
end
function T_out = myFun(T_in)
vlist = T_in.Properties.VariableNames;
T_out = T_in(:, vlist) % preallocation
% overwrite values
for i = 1 : length(vlist)
T_out.vlist(2:end, i) = T_in{1:end-1, vlist{i}};
end
end
Inside myFun(), I do some calculations on T.val1, T.val2.... However, above is too slow. I checked the profiler and found out that most of the time is spent in
1) T_=T(tmpIdx, :) (tabular.vertcat)
2) T_out.vlist(2:end, i) = T_.vlist{1:end-1, vlist{i}} inside myFun (tabular.subsasgn, tabular.subsref)
For 1), despite its slow speed, I guess slicing T is inevitable as I'm calculating values w.r.t. unique pairs of T.group1 and T.group2.
Though the main problem is 1), the time spent on 2) is also significant.
Any good ideas for improvements? Any comments/suggestions will be very much helpful.

Get final variable for a code with loops on Matlab

I have a code with two for loops. The code is working properly. The problem is that at the end I would like to get a variable megafinal with the results for all the years. The original varaible A has 3M rows, so it gives me an error because the size of the megafinal changes with each loop iteration and matlab stops running the code. I guess it’s a problem of inefficiency. Does anyone know a way to get this final variable despite of the size?
y = 1997:2013;
for i=1:length(y)
A=b(cell2mat(b(:,1))==y(i),:);
%Obtain the absolute value of the difference
c= cellfun(#minus,A(:,3),A(:,4));
c=abs(c);
c= num2cell(c);
A(:,end+1) = c;
%Delete rows based on a condition
d = (abs(cell2mat(A(:,8)) - cell2mat(A(:,7))));
[~, ind1] = sort(d);
e= A(ind1(end:-1:1),:);
[~, ind2,~] = unique(strcat(e(:,2),e(:, 6)));
X= e(ind2,:);
(…)
for j = 2:length(X)
if strcmp(X(j,2),X(j-1,2)) == 0
lin2 = j-1;
%Sort
X(lin1:lin2,:) = sortrows(X(lin1:lin2,:),13);
%Rank
[~,~,f]=unique([X{lin1:lin2,13}].');
g=accumarray(f,(1:numel(f))',[],#mean);
X(lin1:lin2,14)=num2cell(g(f));
%Score
out1 = 100 - ((cell2mat(X(lin1:lin2,14))-1) ./ size(X(lin1:lin2,:),1))*100;
X(lin1:lin2,15) = num2cell(out1);
lin1 = j;
end
end
%megafinal(i)=X
end

Make megafinal a cell array. This will account for the varying sizes of X at each iteration. As such, simply do this:
megafinal{i} = X;
To access a cell element, you just have to do megafinal{num}, where num is any index you want.

MATLAB: Creating a matrix from for loop values?

I have the following code:
for i = 1450:9740:89910
n = i+495;
range = ['B',num2str(i),':','H',num2str(n)];
iter = xlsread('BrokenDisplacements.xlsx' , range);
displ = iter;
displ = [displ; iter];
end
Which takes values from an Excel file from a number of ranges I want and outputs them as matricies. However, this code just uses the final value of displ and creates the total matrix from there. I would like to total these outputs (displ) into one large matrix saving values along the way, how would I go about doing this?

Since you know the size of the block of data you are reading, you can make your code much more efficient as follows:
firstVals = 1450:9740:89910;
displ = zeros((firstVals(end) - firstVals(1) + 1 + 496), 7);
for ii = firstVals
n = ii + 495;
range = sprintf('B%d:H%d', ii, ii+495);
displ((ii:ii+495)-firstVals(1)+1,:) = xlsread('BrokenDiplacements.xlsx', range);
end
Couple of points:
I prefer not to use i as a variable since it is built in as sqrt(-1) - if you later execute code that assumes that to be true, you're in trouble
I am not assuming that the last value of ii is 89910 - by first assigning the value to a vector, then finding the last value in the vector, I sidestep that question
I assign all space in iter at once - otherwise, as it grows, Matlab keeps having to move the array around which can slow things down a lot
I used sprintf to generate the string representing the range - I think it's more readable but it's a question of style
I assign the return value of xlsread directly to a block in displ that is the right size
I hope this helps.

How about this:
displ=[];
for i = 1450:9740:89910
n = i+495;
range = ['B',num2str(i),':','H',num2str(n)];
iter = xlsread('BrokenDisplacements.xlsx' , range);
displ = [displ; iter];
end

How to improve execution time of the following Matlab code

Please help me to improve the following Matlab code to improve execution time.
Actually I want to make a random matrix (size [8,12,10]), and on every row, only have integer values between 1 and 12. I want the random matrix to have the sum of elements which has value (1,2,3,4) per column to equal 2.
The following code will make things more clear, but it is very slow.
Can anyone give me a suggestion??
clc
clear all
jum_kel=8
jum_bag=12
uk_pop=10
for ii=1:uk_pop;
for a=1:jum_kel
krom(a,:,ii)=randperm(jum_bag); %batasan tidak boleh satu kelompok melakukan lebih dari satu aktivitas dalam satu waktu
end
end
for ii=1:uk_pop;
gab1(:,:,ii) = sum(krom(:,:,ii)==1)
gab2(:,:,ii) = sum(krom(:,:,ii)==2)
gab3(:,:,ii) = sum(krom(:,:,ii)==3)
gab4(:,:,ii) = sum(krom(:,:,ii)==4)
end
for jj=1:uk_pop;
gabh1(:,:,jj)=numel(find(gab1(:,:,jj)~=2& gab1(:,:,jj)~=0))
gabh2(:,:,jj)=numel(find(gab2(:,:,jj)~=2& gab2(:,:,jj)~=0))
gabh3(:,:,jj)=numel(find(gab3(:,:,jj)~=2& gab3(:,:,jj)~=0))
gabh4(:,:,jj)=numel(find(gab4(:,:,jj)~=2& gab4(:,:,jj)~=0))
end
for ii=1:uk_pop;
tot(:,:,ii)=gabh1(:,:,ii)+gabh2(:,:,ii)+gabh3(:,:,ii)+gabh4(:,:,ii)
end
for ii=1:uk_pop;
while tot(:,:,ii)~=0;
for a=1:jum_kel
krom(a,:,ii)=randperm(jum_bag); %batasan tidak boleh satu kelompok melakukan lebih dari satu aktivitas dalam satu waktu
end
gabb1 = sum(krom(:,:,ii)==1)
gabb2 = sum(krom(:,:,ii)==2)
gabb3 = sum(krom(:,:,ii)==3)
gabb4 = sum(krom(:,:,ii)==4)
gabbh1=numel(find(gabb1~=2& gabb1~=0));
gabbh2=numel(find(gabb2~=2& gabb2~=0));
gabbh3=numel(find(gabb3~=2& gabb3~=0));
gabbh4=numel(find(gabb4~=2& gabb4~=0));
tot(:,:,ii)=gabbh1+gabbh2+gabbh3+gabbh4;
end
end

Some general suggestions:
Name variables in English. Give a short explanation if it is not immediately clear,
what they are indented for. What is jum_bag for example? For me uk_pop is music style.
Write comments in English, even if you develop source code only for yourself.
If you ever have to share your code with a foreigner, you will spend a lot of time
explaining or re-translating. I would like to know for example, what
%batasan tidak boleh means. Probably, you describe here that this is only a quick
hack but that someone should really check this again, before going into production.
Specific to your code:
Its really easy to confuse gab1 with gabh1 or gabb1.
For me, krom is too similar to the built-in function kron. In fact, I first
thought that you are computing lots of tensor products.
gab1 .. gab4 are probably best combined into an array or into a cell, e.g. you
could use
gab = cell(1, 4);
for ii = ...
gab{1}(:,:,ii) = sum(krom(:,:,ii)==1);
gab{2}(:,:,ii) = sum(krom(:,:,ii)==2);
gab{3}(:,:,ii) = sum(krom(:,:,ii)==3);
gab{4}(:,:,ii) = sum(krom(:,:,ii)==4);
end
The advantage is that you can re-write the comparsisons with another loop.
It also helps when computing gabh1, gabb1 and tot later on.
If you further introduce a variable like highestNumberToCompare, you only have to
make one change, when you certainly find out that its important to check, if the
elements are equal to 5 and 6, too.
Add a semicolon at the end of every command. Having too much output is annoying and
also slow.
The numel(find(gabb1 ~= 2 & gabb1 ~= 0)) is better expressed as
sum(gabb1(:) ~= 2 & gabb1(:) ~= 0). A find is not needed because you do not care
about the indices but only about the number of indices, which is equal to the number
of true's.
And of course: This code
for ii=1:uk_pop
gab1(:,:,ii) = sum(krom(:,:,ii)==1)
end
is really, really slow. In every iteration, you increase the size of the gab1
array, which means that you have to i) allocate more memory, ii) copy the old matrix
and iii) write the new row. This is much faster, if you set the size of the
gab1 array in front of the loop:
gab1 = zeros(... final size ...);
for ii=1:uk_pop
gab1(:,:,ii) = sum(krom(:,:,ii)==1)
end
Probably, you should also re-think the size and shape of gab1. I don't think, you
need a 3D array here, because sum() already reduces one dimension (if krom is
3D the output of sum() is at most 2D).
Probably, you can skip the loop at all and use a simple sum(krom==1, 3) instead.
However, in every case you should be really aware of the size and shape of your
results.
Edit inspired by Rody Oldenhuis:
As Rody pointed out, the 'problem' with your code is that its highly unlikely (though
not impossible) that you create a matrix which fulfills your constraints by assigning
the numbers randomly. The code below creates a matrix temp with the following characteristics:
The numbers 1 .. maxNumber appear either twice per column or not at all.
All rows are a random permutation of the numbers 1 .. B, where B is equal to
the length of a row (i.e. the number of columns).
Finally, the temp matrix is used to fill a 3D array called result. I hope, you can adapt it to your needs.
clear all;
A = 8; B = 12; C = 10;
% The numbers [1 .. maxNumber] have to appear exactly twice in a
% column or not at all.
maxNumber = 4;
result = zeros(A, B, C);
for ii = 1 : C
temp = zeros(A, B);
for number = 1 : maxNumber
forbiddenRows = zeros(1, A);
forbiddenColumns = zeros(1, A/2);
for count = 1 : A/2
illegalIndices = true;
while illegalIndices
illegalIndices = false;
% Draw a column which has not been used for this number.
randomColumn = randi(B);
while any(ismember(forbiddenColumns, randomColumn))
randomColumn = randi(B);
end
% Draw two rows which have not been used for this number.
randomRows = randi(A, 1, 2);
while randomRows(1) == randomRows(2) ...
|| any(ismember(forbiddenRows, randomRows))
randomRows = randi(A, 1, 2);
end
% Make sure not to overwrite previous non-zeros.
if any(temp(randomRows, randomColumn))
illegalIndices = true;
continue;
end
end
% Mark the rows and column as forbidden for this number.
forbiddenColumns(count) = randomColumn;
forbiddenRows((count - 1) * 2 + (1:2)) = randomRows;
temp(randomRows, randomColumn) = number;
end
end
% Now every row contains the numbers [1 .. maxNumber] by
% construction. Fill the zeros with a permutation of the
% interval [maxNumber + 1 .. B].
for count = 1 : A
mask = temp(count, :) == 0;
temp(count, mask) = maxNumber + randperm(B - maxNumber);
end
% Store this page.
result(:,:,ii) = temp;
end

OK, the code below will improve the timing significantly. It's not perfect yet, it can all be optimized a lot further.
But, before I do so: I think what you want is fundamentally impossible.
So you want
all rows contain the numbers 1 through 12, in a random permutation
any value between 1 and 4 must be present either twice or not at all in any column
I have a hunch this is impossible (that's why your code never completes), but let me think about this a bit more.
Anyway, my 5-minute-and-obvious-improvements-only-version:
clc
clear all
jum_kel = 8;
jum_bag = 12;
uk_pop = 10;
A = jum_kel; % renamed to make language independent
B = jum_bag; % and a lot shorter for readability
C = uk_pop;
krom = zeros(A, B, C);
for ii = 1:C;
for a = 1:A
krom(a,:,ii) = randperm(B);
end
end
gab1 = sum(krom == 1);
gab2 = sum(krom == 2);
gab3 = sum(krom == 3);
gab4 = sum(krom == 4);
gabh1 = sum( gab1 ~= 2 & gab1 ~= 0 );
gabh2 = sum( gab2 ~= 2 & gab2 ~= 0 );
gabh3 = sum( gab3 ~= 2 & gab3 ~= 0 );
gabh4 = sum( gab4 ~= 2 & gab4 ~= 0 );
tot = gabh1+gabh2+gabh3+gabh4;
for ii = 1:C
ii
while tot(:,:,ii) ~= 0
for a = 1:A
krom(a,:,ii) = randperm(B);
end
gabb1 = sum(krom(:,:,ii) == 1);
gabb2 = sum(krom(:,:,ii) == 2);
gabb3 = sum(krom(:,:,ii) == 3);
gabb4 = sum(krom(:,:,ii) == 4);
gabbh1 = sum(gabb1 ~= 2 & gabb1 ~= 0)
gabbh2 = sum(gabb2 ~= 2 & gabb2 ~= 0);
gabbh3 = sum(gabb3 ~= 2 & gabb3 ~= 0);
gabbh4 = sum(gabb4 ~= 2 & gabb4 ~= 0);
tot(:,:,ii) = gabbh1+gabbh2+gabbh3+gabbh4;
end
end