I'm having trouble figuring out how to count the occurrences of a character in a string within a cell. For example, I have a file that contains information like so:
type
m
mmNs
SmNm
and I'm trying to determine how many m's are in each line. To do this, I've tried this code:
sampleddata = dataset('file','sample.txt','Delimiter','\t');
muts = sampleddata.type;
fileID = fopen('number_occur.txt','w');
for j = 1:3
mutations = muts(j)
M = length(find(mutations == 'm'));
fprintf(fileID, '%1f\n',M)
end
fclose(fileID)
However, I get an error that informs me: "Undefined operator '==' for input arguments of type 'cell'." Does anyone know how to overcome this problem?
Gonna post a result here in case you did not find a way to do it. There are loads of ways to do it, I am just going to put one of them.
Basically, you want a regex to do string matches:
a = {'type';
'm';
'mmNs';
'SmNm';
'mmmmM'} %//Load in Data,
pattern = 'm'; %//The pattern you are looking for is 'm', it could be anything really, a number of specific word or a specific pattern
lines = regexp(a, pattern, 'tokens'); %// look for this pattern in each line
result = cellfun('length',lines); %//count the size of matched patterns, so each time it matches, the size should increase by 1.
This gives the result in a matrix form:
result =
0
1
2
2
4
Related
I have a string containing several elements, some identical and some unique. I want my code to check every 2 following elements in my string and if they're equal, it should call a function ShuffleString, where the input variable (randomize) is the string itself, that will re-shuffle the string in a new position. Then, the script should re-check every 2 following elements in the string again until no two identical elements appear next to each other.
I have done the following:
My function file ShuffleString works fine. The input variable randomize, as stated earlier, contains the same elements as MyString but in a different order, as this was needed on an unrelated matter earlier in the script.
function [MyString] = ShuffleString(randomize)
MyString = [];
while length(randomize) > 0
S = randi(length(randomize), 1);
MyString = [MyString, randomize(S)];
randomize(S) = [];
end
The script doesn't work as intended. Right now it looks like this:
MyString = ["Cat" "Dog" "Mouse" "Mouse" "Dog" "Hamster" "Zebra" "Obama"...
"Dog" "Fish" "Salmon" "Turkey"];
randomize = MyString;
while(1)
for Z = 1:length(MyString)
if Z < length(MyString)
Q = Z+1;
end
if isequal(MyString{Z},MyString{Q})
[MyString]=ShuffleString(randomize)
continue;
end
end
end
It just seems to reshuffle the string an infinite amount of times. What's wrong with this and how can I make it work?
You are using an infinite while loop that has no way to break and hence it keeps iterating.
Here is a simpler way:
Use the third output argument of the unique function to get the elements in numeric form for easier processing. Apply diff on it to check if consecutive elements are same. If there is any occurrence of same consecutive elements, the output of diff will give at least one zero which when applied with negated all will return true to continue the loop and vice versa. At the end, use the shuffled indices/numeric representation of the strings obtained after the loop to index the first output argument of unique (which was calculated earlier). So the script will be:
MyString = ["Cat" "Dog" "Mouse" "Mouse" "Dog" "Hamster" "Zebra" "Obama"...
"Dog" "Fish" "Salmon" "Turkey"]; %Given string array
[a,~,c] = unique(MyString);%finding unique elements and their indices
while ~all(diff(c)) %looping until there are no same strings together
c = ShuffleString(c); %shuffling the unique indices
end
MyString = a(c); %using the shuffled indices to get the required string array
For the function ShuffleString, a better way would be to use randperm. Your version of function works but it keeps changing the size of the arrays MyString and randomize and hence adversely affects the performance and memory usage. Here is a simpler way:
function MyString = ShuffleString(MyString)
MyString = MyString(randperm(numel(MyString)));
end
I have an excel file and I need to read it based on string values in the 4th column. I have written the following but it does not work properly:
[num,txt,raw] = xlsread('Coordinates','Centerville');
zn={};
ctr=0;
for i = 3:size(raw,1)
tf = strcmp(char(raw{i,4}),char(raw{i-1,4}));
if tf == 0
ctr = ctr+1;
end
zn{ctr}=raw{i,4};
end
data=zeros(1,10); % 10 corresponds to the number of columns I want to read (herein, columns 'J' to 'S')
ctr=0;
for j = 1:length(zn)
for i=3:size(raw,1)
tf=strcmp(char(raw{i,4}),char(zn{j}));
if tf==1
ctr=ctr+1;
data(ctr,:,j)=num(i-2,10:19);
end
end
end
It gives me a "15129x10x22 double" thing and when I try to open it I get the message "Cannot display summaries of variables with more than 524288 elements". It might be obvious but what I am trying to get as the output is 'N = length(zn)' number of matrices which represent the data for different strings in the 4th column (so I probably need a struct; I just don't know how to make it work). Any ideas on how I could fix this? Thanks!
Did not test it, but this should help you get going:
EDIT: corrected wrong indexing into raw vector. Also, depending on the format you might want to restrict also the rows of the raw matrix. From your question, I assume something like selector = raw(3:end,4); and data = raw(3:end,10:19); should be correct.
[~,~,raw] = xlsread('Coordinates','Centerville');
selector = raw(:,4);
data = raw(:,10:19);
[selector,~,grpidx] = unique(selector);
nGrp = numel(selector);
out = cell(nGrp,1);
for i=1:nGrp
idx = grpidx==i;
out{i} = cell2mat(data(idx,:));
end
out is the output variable. The key here is the variable grpidx that is an output of the unique function and allows you to trace back the unique values to their position in the original vector. Note that unique as I used it may change the order of the string values. If that is an issue for you, use the setOrderparameter of the unique function and set it to 'stable'
i have a cell like this:
x = {'3D'
'B4'
'EF'
'D8'
'E7'
'6C'
'33'
'37'}
let's assume that the cell is 1000x1. i want to find the number of occurrence of pattern = [30;30;64;63] within this cell but as the order shown. in the other word it's first check x{1,1},x{2,1},x{3,1},x{4,1}
then check x{2,1},x{3,1},x{4,1},x{5,1} and like this till the end of the cell and return the number of occurrence of it.
Here is my code but it didn't work!
while (size (pattern)< size(x))
count = 0;
for i=1:size(x)-length(pattern)+1
if size(abs(x(i:i+size(pattern)-1)-x))==0
count = count+1;
end
end
end
Your example code has a couple of issues - foremost I don't believe you are doing any comparison operations, which would be necessary to identify the occurrence of the pattern within the search data (x). Also, there is a variable type mismatch between x and pattern - one is a cell array of strings, and the other is a decimal array.
One way to approach this problem would be to restructure x and pattern as strings, and then use strfind to find occurrences of pattern. This method will only work if there is no missing data in either of the variables.
x = {'3D';'B4';'EF';'D8';'E7';'6C';'33';'37';'xE';'FD';'8y'};
pattern = {'EF','D8'};
collated_x=[x{:}];
collated_pattern = [pattern{:}];
found_locations = strfind(collated_x, collated_pattern);
% Remove 'offset' matches that start at even locations
found_locations = found_locations(mod(found_locations,2)==1);
count = length(found_locations)
Use string find function.
This is fast and simple solution:
clear
str_pattern=['B4','EF']; %same as str_pattern=['B4EF'];
x = {'3D'
'B4'
'EF'
'D8'
'EB'
'4E'
'F3'
'B4'
'EF'
'37'} ;
str_x=horzcat(x{:});
inds0=strfind(str_x,str_pattern); %including in-middle
inds1=inds0(bitand(inds0,1)==1); %exclude all in-middle results
disp(str_x);
disp(str_pattern);
disp(inds0);
disp(inds1);
I have a huge cell vector cc (size: 1xN) of the form:
cc{1} = {'indexString1', 'str_row1col1', 'str_row1col2' }
cc{2} = {'indexString2', 'str_row2col1', 'bighello', 'str_row1col3' }
cc{3} = {'indexString3','str_row3col1'}
cc{4} = {'indexString4','str_row3col1', 'helloWorld'}
I want to traverse each cell and remove specific cells that contain the word "hello", e.g c{4}{2}. Can we do that without for loops keeping the final structure of cc?
Best,
Thoth.
EDIT: From the answers and comments I have seen that the structure of the cell impose some limitations. So any other suggestion to store my data are welcome. I just want to keep together all the cells (e.g. 'str_row1col1', 'str_row1col2') that correspond to the same indexString*n* (e.g. indexString1). I made this edit in case it helps some final reshape.
Using regular expressions, you can obtain a logical array in which zeros represent occurences of the word 'hello' somewhere in the nested cell. As #LuisMendo pointed out, this would be much easier to delete the unwanted cells if they were not nested:
clc
clear
cc{1} = {'str_row1col1', 'str_row1col2' };
cc{2} = {'str_row2col1', 'bighello', 'str_row1col3' };
cc{3} = {'str_row3col1'};
cc{4} = {'str_row3col1', 'helloWorld'};
A = (cellfun(#isempty,regexp([cc{:}],'(\w*hello|hello\w*)','match')))
Gives the following array:
A =
1 1 1 0 1 1 1 0
For the rest I think you would need a loop since the nested cells are not all of the same size. Anyhow I hope it helps you a bit.
EDIT Here is what you can do using a for loop. In order to identify words of interest (earth and water as in your comment below), simply add them to the argument in the call to regexp. This character: | is used to make some sort of list so that Matlab checks all the expressions in the brackets.
Please refer to this page for more infos on regular expressions. There is also a possibility to look for regular expressions with case-sensitivity.
Sample code, in which I added strings containing earth and water:
cc{1} = {'str_row1col1', 'earth!superman' 'str_row1col2' 'DummyString'};
cc{2} = {'str_row2col1', 'bighello', 'str_row1col3' };
cc{3} = {'str_row3col1' 'str_row3col3' 'water_batman'};
cc{4} = {'str_row3col1' 'str_row4col2' 'helloWorld'};
cc{5} = {'str_row5_LegoMan' 'str_row5col2' 'AnotherDummyString' 'Useless String' 'BonjourWorld'};
% With a for loop, for example:
FinalCell = cell(size(cc,2),1);
for k = 1:size(cc,2)
DummyCell = cc{k}; % Use dummy cell for easier indexing
% This is where you tell Matlab what words/expressions you are looking for
A = cellfun(#isempty,regexp(cc{k},'(\w*hello|hello\w*|earth|water)','match'));
DummyCell(~A) = []; % Remove the cells containing the strings/words of interest
FinalCell{k} = DummyCell;
end
Then you're good to go. Hope that helps!
The closest thing possible I found is:
clear all
cc{1} = {'str_row1col1', 'str_row1col2' };
cc{2} = {'str_row2col1', 'bighello', 'str_row1col3' };
cc{3} = {'str_row3col1'};
cc{4} = {'str_row3col1', 'helloWorld'};
cc1 = [cc{:}];
cc1 = cc1(~strcmp('bighello',cc1));
This reorganize your array into a one dimensional array and it cannot match regular expression, but only whole words.
For a better job I am afraid you have to use for loops.
I am attempting to loop through the variable 'docs' which is a cell array that holds strings, i need to make a for loop that colllects the terms in a cell array and then uses command 'lower' and unique to create a dictionary.
Here is the code i've tried sp far and i just get errors
docsLength = length(docs);
for C = 1:docsLength
list = tokenize(docs, ' .,-');
Mylist = [list;C];
end
I get these errors
Error using textscan
First input must be of type double or string.
Error in tokenize (line 3)
C = textscan(str,'%s','MultipleDelimsAsOne',1,'delimiter',delimiters);
Error in tk (line 4)
list = tokenize(docs, ' .,-');
Generically, if you get an "must be of type" error, that means you are passing the wrong sort of input to a function. In this case you should look at the point in your code where this is taking place (here, in tokenize when textscan is called), and doublecheck that the input going in is what you expect it to be.
As tokenize is not a MATLAB builtin function, unless you show us that code we can't say what those inputs should be. However, as akfaz mentioned in comments, it is likely that you want to pass docs{C} (a string) to tokenize instead of docs (a cell array). Otherwise, there's no point in having a loop as it just repeatedly passes the same input, docs, into the function.
There are additional problems with the loop:
Mylist = [list; C]; will be overwritten each loop to consist of the latest version of list plus C, which is just a number (the index of the loop). Depending on what the output of tokenize looks like, Mylist = [Mylist; list] may work but you should initialise Mylist first.
Mylist = [];
for C = 1:length(docs)
list = tokenize(docs{C}, ' .,-');
Mylist = [Mylist; list];
end