How do I determine where a word appears over multiple sentences? - matlab

I have the following code that tells me how many times a word appears in a sentence. Specifically, I have a 1D cell array of sentence strings and a 1D cell array of words that I want to use to search within each sentence. This code is a 2D cell array where each row and column combination tells me how many times I see a particular word (column) appear in a sentence (row). In other words:
Out = NaN(numel(sentences), numel(out_words));
for i = 1:numel(out_words)
Out(:,i) = cellfun(#(x) numel(strfind(x, out_words{i})), sentences);
end
display(Out)
What I would like now is a 1D cell array Out where each element describes a word and within this element is a vector that tells you which sentences the word appears in.
For example, if the word is trees and trees is assigned an ID of 1, a potential vector within the cell element of Out{1} could be [1,5,8], which means that the word trees appeared in sentences 1, 5, and 8. Is there a way to do this in MATLAB?

I will reiterate what I asked in the comments so people on the StackOverflow community will know what the question is really asking:
The OP wants a 1D cell array Out where each element describes a word and within this element is a vector that tells you which sentences the word appears in.
For example, if the word is trees and trees is assigned an ID of 1, a potential vector within the cell element of Out{1} could be [1,5,8], which means that the word trees appeared in sentences 1, 5, and 8.
One easy way to do this would be to loop over every word that you have and use strfind and see which elements in the output are non-empty. For those locations that are non-empty, this would determine where the word has occurred in a particular sentence. Let's do an example. I'm going to choose six sentences to be the first six lyrics of Bruce Cockburn's Lovers in a Dangerous Time. I'll declare this to be in a cell array of strings called sentences. To be sure we can find words correctly and not worry about case, we will convert all sentences to lower case with lower:
sentences = {'Don''t the hours grow shorter as the days go by',
'You never get to stop and open your eyes'
'One day you''re waiting for the sky to fall'
'The next you''re dazzled by the beauty of it all'
'When you''re lovers in a dangerous time'
'Lovers in a dangerous time'};
sentences = lower(sentences);
Next, I'm going to declare a words array that determines which words I want to find over all of the sentences:
words = {'you', 'one', 'dangerous', 'beauty', 'go', 'to'};
As such, the code you want is very simply:
Out = cell(numel(words), 1); %// Declare empty array of cells for each word
for idx = 1 : numel(words) %// For each word...
K = strfind(sentences, words{idx}); %// See which sentences have these words
ind = cellfun('isempty', K); %// Determine which locations are EMPTY
Out{idx} = find(~ind); %// To find those locations that are non-empty, we need to find those entries that are 0, so search for the inverse
end
Let's go through the above code slowly. We first declare a cell array of elements (1D) that is as long as the total number of words we have. Next, for each word, we use strfind to determine whether we can find that particular word in all of the sentences. strfind will return a cell array where each element in this array tells you the starting index (or indices if there is more than one occurrence) of where we have found this word. If an element in this cell array is empty, this means that we did not find the word in the sentence.
Now, what we're going to do next is search within this cell array for any entries that are empty. This can be done with cellfun and the output will be a logical vector where 1 means it's empty and 0 means it's non-empty. We want to find those locations that are non-empty, and so we use find to search for locations that are non-empty. These locations ultimately determine whether we have found that word in that sentence.
As such, if we run with the above example, this is what we get:
>> celldisp(Out)
Out{1} =
2
Out{2} =
3
Out{3} =
5
6
Out{4} =
4
Out{5} =
1
Out{6} =
2
3
This means that for the first word, you, we have found this word in sentence #2, which is: 'you never get to stop and open your eyes'. Next, the second word is one, and we have found this in sentence #3, which reads: 'one day you're waiting for the sky to fall'. The next word is dangerous, and we have found this in sentences #5 and #6, which read: 'when you're lovers in a dangerous time' and
'lovers in a dangerous time'. You can follow along with the rest of the cell array and you can verify that what each cell element gives you are the sentence IDs that tell you where each word appeared.
In your comments, you want to go further and make an associative array where you specify the string you want and the output will be the sentence IDs of the sentences that contain these words. You can use a containers.Map class to do that for you. Specifically:
out_dict = containers.Map(words, Out);
Now, you can do:
>> out_dict('dangerous')
ans =
5
6
If you want to input in multiple words, use the values method:
vals = values(out_dict, {'you', 'dangerous', 'go'})
celldisp(vals)
vals{1} =
2
vals{2} =
5
6
vals{3} =
1
If you want to display all of the words, just do:
vals = values(out_dict, words);
celldisp(vals)
vals{1} =
2
vals{2} =
3
vals{3} =
5
6
vals{4} =
4
vals{5} =
1
vals{6} =
2
3
BTW, I will reiterate this for you. Please consider accepting the answers that were provided to you in your previous questions. This signifies to the StackOverflow community that you no longer need help for your questions. Since you like reading answers, you can read this to help you figure out how to accept answers: How to update and accept answers
Good luck!

Related

Write to the textbox only if the condition is true in Matlab

I struggle with printing text in my Matlab GUI.
I have code like this in my callback:
if Lia == ismember(handles.T(1:3),(1,1,1))
set(handles.t1, 'String', 'good day');
end
The problem is, I don't know how to check if in my array indexes from 1 to 3 I got this numbers: 1,1,1. I was looking to the documentation but it appears it says nothing about that (or I simply cannot find the proper answer).
You can simply use all and check to see if every element in the first three slots of your array match the values of 1 explicitly. I don't know the shape of your array so I'm going to force it to be a column vector. If the first three slots of the array was a row or column vector and if we assumed that the values of 1 are a column or row vector respectively then you're going to get a rather unpleasant surprise:
h = handles.T(1:3);
if all(h(:) == [1; 1; 1])
set(handles.t1, 'String', 'good day');
end
Note that I could have simply done all(h(:) == 1) as a special case since we are performing a comparison of every element in an array with a single value. However, I have a feeling that this may change for you, so I've decided to explicitly make a vector of 1s so you can change the contents of what you want to compare to at a later time.

Search for an exact match in string

Given a table with the following format in MATLAB:
itemids keywords
1 3D,children,anim,pixar,3D,3D pixar
2 3D,4D pixar,3D car
... ...
I want to count the number of times each keyword is repeated in each item. All the list of unique keywords are available in keywords = {'3D';'Children';'anim';'pixar' ...}. The output is a matrix TF with rows equal to the number of items and columns equal to length(keywords).
One of the difficulties here is to search for an exact match for each string. I am currently using strcmp() which seems to be giving all the entries with a given word, not exact match. In my case I would need to differentiate between 3D and 3D pixar.
This can be done using the ismember function in MATLAB. I am assuming that keywords for each item is actually a single string in which case you will need to split the keywords before doing ismember.
relevantKeyWords = {'3D','Children','anim','pixar'};
keywordsInItem = strtrim(strsplit(keywordsStr,',')) % Split the words and trim each word
tmp = ismember(relevantKeywords,keywordsInItem);
tmp will be of size 1 x length(relevantKeywords) indicating if the relevant keyword was found.

How to select Text sentences from original document that are split and numbered in Matlab?

I have a text document, i split this text document into separate sentences after full stop and displayed them, code used for this is as follows:
sentences = regexp(F,'\S.*?[\.\!\?]','match')
char(sentences)
Now i did some processing and got a selected number of sentences in the form of number like 1,2,3,4,...n which are stored in 1D cell arrays as follows:
output=
out{1}= 1,2
out{2}= 2, 4
out{n}= n..
These 1,2,4 are the sentence numbers, i want to select and display only sentence # 1,2 and 4 from sentences suppose i have 10 sentences so the output should be 3 sentences now.
There are many ways to select and display only the indexed sentences. For instance:
1- With a for loop
for i = 1:numel(out{1})
fprintf('%s\n', sentences{out{1}});
end
2- In one line, with cellfun:
cellfun(#(x) fprintf('%s\n',x), sentences(out{1}));
Best,

Generate a random String in MatLab

I am trying to generate an array of string from a long predefined array of chars as the following
if I have the following long string:
s= 'aardvaqrkaardwolfaajronabackabacusabvaftabalongeabandonabandzonedaba'
I want to create a group of random strings based on the following rules
the string should be between 4 and 12 chars should be end or start
with one of the following chars {j,q,v,f,x,g,b,d,z}
So here a solution which gives all strings which fullfill the following rules:
the starting and ending char has to be from the string:
start_end_char= 'jqvfxgbdz';
The length has to be between 4 and 8 chars long
The string has to be sequentially correct. Meaning the resulting
strings have to appear in the exact same way in the "long" string
So what am I doing?
First of all I find all the positions where the predefined starting and ending chars appear in the main string (careful I used s2 instead of s as string name).
Then I get a sorted list of those points (list_sorted)
Next thing is to get for each element a list of indices which acceptable ending chars (following rule 1 and 2 stated above). These are saved in helper which has to be a cell-datatype because of different length in the strings
last but not least I construct all those strings and save them in resulting_strings which also have to be a cell-datatype.
s2= 'aardvaqrkaardwolfaajronabackabacusabvaftabalongeabandonabandzonedaba';
start_end_char= 'jqvfxgbdz';
length_start = length(start_end_char);
%%finding all positions of possible starting/ending points
position_char= cell(1,length_start);
for k=1:length_start
position_char{k}=find(s2==start_end_char(k));
end
list_of_start_end_points=[];
%% getting an array with all starting/ending points in the given array
for k=1:length_start
list_of_start_end_points= horzcat(list_of_start_end_points,position_char{k});
end
sorted_list= sort(list_of_start_end_points);
%% getting possible combinations
helper = cell(1, length(sorted_list));
length_helper=[];
for k=1:length(sorted_list)
helper{k}=find(and(sorted_list-sorted_list(k)>=4,sorted_list-sorted_list(k)<=8));
length_helper = length_helper + length(helper);
end
resulting_strings = cell(1, length_helper);
l=1;
for k=1:length(sorted_list)
for m=1:length(helper{k})
resulting_strings{1,l} = s2(sorted_list(k):sorted_list(helper{k}(m)));
l=l+1;
end
end
This solution is using quite a few loops, while the first 2 loops are negatable (No of loops in the size of acceptable start/ending letters), the later two loops can be quite time consuming if the original string is much longer. So maybe someone will find a vectorized solution for the later loops.

find all possible element combinations containing the last element in matlab

I have a vector [x1, x2,...xn]. Is there a way to find all possible combinations of elements that contain the last element xn? For example, if I have 4 elements I want the combinations:
x1,x4
x2,x4
x3,x4
x1,x2,x3,x4
x1,x2,x4
x1,x3,x4
x2,x3,x4
In reality though I have number of elements up to a few hundreds.
Thank you for your time!
You really just need to do a choose on all of the elements except the last one.
C = cell(length(x)-1,1);
for n = 1:length(x)-1
C{n} = nchoosek(x(1:end-1),n);
end
Each element of C contains all possible vectors with n elements. All you have to do is tack onx(end) to each one to get what you're looking for. For example, if combo=C{4}(7,:) is one solved set without the last element of x, then your desired output is combo=[combo x(end)]. To do this for all solutions, just add this line of code inside the loop above:
C{n} = [C{n} x(end)*ones(size(C{n},1),1)];
WARNING: With thousands of elements you will run out of memory very quickly. Just 100 elements gives you over 6e29 possible combinations!