Search for an exact match in string - matlab

Given a table with the following format in MATLAB:
itemids keywords
1 3D,children,anim,pixar,3D,3D pixar
2 3D,4D pixar,3D car
... ...
I want to count the number of times each keyword is repeated in each item. All the list of unique keywords are available in keywords = {'3D';'Children';'anim';'pixar' ...}. The output is a matrix TF with rows equal to the number of items and columns equal to length(keywords).
One of the difficulties here is to search for an exact match for each string. I am currently using strcmp() which seems to be giving all the entries with a given word, not exact match. In my case I would need to differentiate between 3D and 3D pixar.

This can be done using the ismember function in MATLAB. I am assuming that keywords for each item is actually a single string in which case you will need to split the keywords before doing ismember.
relevantKeyWords = {'3D','Children','anim','pixar'};
keywordsInItem = strtrim(strsplit(keywordsStr,',')) % Split the words and trim each word
tmp = ismember(relevantKeywords,keywordsInItem);
tmp will be of size 1 x length(relevantKeywords) indicating if the relevant keyword was found.

Related

How to sort an array while keeping the order of the index row matching the sorted row?

Easiest way is to show you through excel:
Unsorted:
Sorted:
This example is with excel, but I would need to do the same thing in matlab with thousands of entries (with 2 rows if possible).
Here is my code so far:
%At are random numbers between 0 and 2, 6000 entries.
[sorted]=sort(At);
max=sorted(end);
min=sorted(1);
%need the position of the min and max
But this is only 1 row that's being sorted and it has no numbers in the second row, and no index. How would I add one and keep it following my first row?
Thank you!
I don't have access to Matlab, but try
[sorted, I] = sort(At);
Where I will be a corresponding vector of indices of At. See the Matlab Documentation for details.
You have a couple of options here. For the simple case where you just need the indices, the fourth form of sort listed in the docs already does this for you:
[sorted, indices] = sort(At);
In this case, At(indices) is the same as sorted.
If your "indices" are actually another distinct array, you can use sortrows:
toSort = [At(:) some_other_array(:)];
sorted = sortrows(toSort);
In this case sorted(:, 1) will be the sorted array from the first example and sorted(:, 2) will be the other array sorted according to At.
sortrows accepts a second parameter which tells you the column to sort by. This can be a single column or a list of columns, like in Excel. It can also provide a second output argument, the indices, just like regular sort.

How do I determine where a word appears over multiple sentences?

I have the following code that tells me how many times a word appears in a sentence. Specifically, I have a 1D cell array of sentence strings and a 1D cell array of words that I want to use to search within each sentence. This code is a 2D cell array where each row and column combination tells me how many times I see a particular word (column) appear in a sentence (row). In other words:
Out = NaN(numel(sentences), numel(out_words));
for i = 1:numel(out_words)
Out(:,i) = cellfun(#(x) numel(strfind(x, out_words{i})), sentences);
end
display(Out)
What I would like now is a 1D cell array Out where each element describes a word and within this element is a vector that tells you which sentences the word appears in.
For example, if the word is trees and trees is assigned an ID of 1, a potential vector within the cell element of Out{1} could be [1,5,8], which means that the word trees appeared in sentences 1, 5, and 8. Is there a way to do this in MATLAB?
I will reiterate what I asked in the comments so people on the StackOverflow community will know what the question is really asking:
The OP wants a 1D cell array Out where each element describes a word and within this element is a vector that tells you which sentences the word appears in.
For example, if the word is trees and trees is assigned an ID of 1, a potential vector within the cell element of Out{1} could be [1,5,8], which means that the word trees appeared in sentences 1, 5, and 8.
One easy way to do this would be to loop over every word that you have and use strfind and see which elements in the output are non-empty. For those locations that are non-empty, this would determine where the word has occurred in a particular sentence. Let's do an example. I'm going to choose six sentences to be the first six lyrics of Bruce Cockburn's Lovers in a Dangerous Time. I'll declare this to be in a cell array of strings called sentences. To be sure we can find words correctly and not worry about case, we will convert all sentences to lower case with lower:
sentences = {'Don''t the hours grow shorter as the days go by',
'You never get to stop and open your eyes'
'One day you''re waiting for the sky to fall'
'The next you''re dazzled by the beauty of it all'
'When you''re lovers in a dangerous time'
'Lovers in a dangerous time'};
sentences = lower(sentences);
Next, I'm going to declare a words array that determines which words I want to find over all of the sentences:
words = {'you', 'one', 'dangerous', 'beauty', 'go', 'to'};
As such, the code you want is very simply:
Out = cell(numel(words), 1); %// Declare empty array of cells for each word
for idx = 1 : numel(words) %// For each word...
K = strfind(sentences, words{idx}); %// See which sentences have these words
ind = cellfun('isempty', K); %// Determine which locations are EMPTY
Out{idx} = find(~ind); %// To find those locations that are non-empty, we need to find those entries that are 0, so search for the inverse
end
Let's go through the above code slowly. We first declare a cell array of elements (1D) that is as long as the total number of words we have. Next, for each word, we use strfind to determine whether we can find that particular word in all of the sentences. strfind will return a cell array where each element in this array tells you the starting index (or indices if there is more than one occurrence) of where we have found this word. If an element in this cell array is empty, this means that we did not find the word in the sentence.
Now, what we're going to do next is search within this cell array for any entries that are empty. This can be done with cellfun and the output will be a logical vector where 1 means it's empty and 0 means it's non-empty. We want to find those locations that are non-empty, and so we use find to search for locations that are non-empty. These locations ultimately determine whether we have found that word in that sentence.
As such, if we run with the above example, this is what we get:
>> celldisp(Out)
Out{1} =
2
Out{2} =
3
Out{3} =
5
6
Out{4} =
4
Out{5} =
1
Out{6} =
2
3
This means that for the first word, you, we have found this word in sentence #2, which is: 'you never get to stop and open your eyes'. Next, the second word is one, and we have found this in sentence #3, which reads: 'one day you're waiting for the sky to fall'. The next word is dangerous, and we have found this in sentences #5 and #6, which read: 'when you're lovers in a dangerous time' and
'lovers in a dangerous time'. You can follow along with the rest of the cell array and you can verify that what each cell element gives you are the sentence IDs that tell you where each word appeared.
In your comments, you want to go further and make an associative array where you specify the string you want and the output will be the sentence IDs of the sentences that contain these words. You can use a containers.Map class to do that for you. Specifically:
out_dict = containers.Map(words, Out);
Now, you can do:
>> out_dict('dangerous')
ans =
5
6
If you want to input in multiple words, use the values method:
vals = values(out_dict, {'you', 'dangerous', 'go'})
celldisp(vals)
vals{1} =
2
vals{2} =
5
6
vals{3} =
1
If you want to display all of the words, just do:
vals = values(out_dict, words);
celldisp(vals)
vals{1} =
2
vals{2} =
3
vals{3} =
5
6
vals{4} =
4
vals{5} =
1
vals{6} =
2
3
BTW, I will reiterate this for you. Please consider accepting the answers that were provided to you in your previous questions. This signifies to the StackOverflow community that you no longer need help for your questions. Since you like reading answers, you can read this to help you figure out how to accept answers: How to update and accept answers
Good luck!

Generate a random String in MatLab

I am trying to generate an array of string from a long predefined array of chars as the following
if I have the following long string:
s= 'aardvaqrkaardwolfaajronabackabacusabvaftabalongeabandonabandzonedaba'
I want to create a group of random strings based on the following rules
the string should be between 4 and 12 chars should be end or start
with one of the following chars {j,q,v,f,x,g,b,d,z}
So here a solution which gives all strings which fullfill the following rules:
the starting and ending char has to be from the string:
start_end_char= 'jqvfxgbdz';
The length has to be between 4 and 8 chars long
The string has to be sequentially correct. Meaning the resulting
strings have to appear in the exact same way in the "long" string
So what am I doing?
First of all I find all the positions where the predefined starting and ending chars appear in the main string (careful I used s2 instead of s as string name).
Then I get a sorted list of those points (list_sorted)
Next thing is to get for each element a list of indices which acceptable ending chars (following rule 1 and 2 stated above). These are saved in helper which has to be a cell-datatype because of different length in the strings
last but not least I construct all those strings and save them in resulting_strings which also have to be a cell-datatype.
s2= 'aardvaqrkaardwolfaajronabackabacusabvaftabalongeabandonabandzonedaba';
start_end_char= 'jqvfxgbdz';
length_start = length(start_end_char);
%%finding all positions of possible starting/ending points
position_char= cell(1,length_start);
for k=1:length_start
position_char{k}=find(s2==start_end_char(k));
end
list_of_start_end_points=[];
%% getting an array with all starting/ending points in the given array
for k=1:length_start
list_of_start_end_points= horzcat(list_of_start_end_points,position_char{k});
end
sorted_list= sort(list_of_start_end_points);
%% getting possible combinations
helper = cell(1, length(sorted_list));
length_helper=[];
for k=1:length(sorted_list)
helper{k}=find(and(sorted_list-sorted_list(k)>=4,sorted_list-sorted_list(k)<=8));
length_helper = length_helper + length(helper);
end
resulting_strings = cell(1, length_helper);
l=1;
for k=1:length(sorted_list)
for m=1:length(helper{k})
resulting_strings{1,l} = s2(sorted_list(k):sorted_list(helper{k}(m)));
l=l+1;
end
end
This solution is using quite a few loops, while the first 2 loops are negatable (No of loops in the size of acceptable start/ending letters), the later two loops can be quite time consuming if the original string is much longer. So maybe someone will find a vectorized solution for the later loops.

find all possible element combinations containing the last element in matlab

I have a vector [x1, x2,...xn]. Is there a way to find all possible combinations of elements that contain the last element xn? For example, if I have 4 elements I want the combinations:
x1,x4
x2,x4
x3,x4
x1,x2,x3,x4
x1,x2,x4
x1,x3,x4
x2,x3,x4
In reality though I have number of elements up to a few hundreds.
Thank you for your time!
You really just need to do a choose on all of the elements except the last one.
C = cell(length(x)-1,1);
for n = 1:length(x)-1
C{n} = nchoosek(x(1:end-1),n);
end
Each element of C contains all possible vectors with n elements. All you have to do is tack onx(end) to each one to get what you're looking for. For example, if combo=C{4}(7,:) is one solved set without the last element of x, then your desired output is combo=[combo x(end)]. To do this for all solutions, just add this line of code inside the loop above:
C{n} = [C{n} x(end)*ones(size(C{n},1),1)];
WARNING: With thousands of elements you will run out of memory very quickly. Just 100 elements gives you over 6e29 possible combinations!

Working with the output from recfromcsv

I'm porting a Matlab script to Python. Below is an extract:
%// Create a list of unique trade dates
DateList = unique(AllData(:,1));
%// Loop through the dates
for DateIndex = 1:size(DateList,1)
CalibrationDate = DateList(DateIndex);
%// Extract the data for a single cablibration date (but all expiries)
SubsetIndices = ismember(AllData(:,1) , DateList(DateIndex)) == 1;
SubsetAllExpiries = AllData(SubsetIndices, :);
AllData is an N-by-6 cell matrix, the first 2 columns are dates (strings) and the other 4 are numbers. In python I will be getting this data out of a csv so something like this:
import numpy as np
AllData = np.recfromcsv(open("MyCSV.csv", "rb"))
So now if I'm not mistaken AllData is a numpy array of ordinary tuples. Is this is best format to have this data in? The goal will be to extract a list of unique dates from column 1, and for each date extract the rows with that date in column 1 (column one is ordered). Then for each row in column one do some maths on the numbers and date in the remaining 5 columns.
So in matlab I can get the list of dates by unique(AllData(:,1)) and then I can get the records (rows) corresponding to that date (i.e. with that date in columns one) like this:
SubsetIndices = ismember(AllData(:,1) , MyDate) == 1;
SubsetAllExpiries = AllData(SubsetIndices, :);
How can I best achieve the same results in Python?
To put things in context, np.recfromcsv is just a modified version of np.genfromtxt which outputs record arrays instead of structured arrays.
A structured array lets you access the individual fields (here, your columns) by their names, like in my_array["field_one"] while a record array gives you the same plus the possibility to access the fields as attributes, like in my_array.field_one. I'm not fond of "access-as-attributes", so I usually stick to structured arrays.
For your information, structurede/record arrays are not arrays of tuples, but arrays of some numpy object call a np.void: it's a block of memory composed of as many sub-blocks you have of fields, the size of each sub-block depending on its datatype.
That said, yes, what you seem to have in mind is exactly the kind of usage for a structured array. The approach would then be:
to take your dates array and filter them to find the unique elements.
to find the indices of these unique elements, as an array of integers we'll call, say, matching;
to use matching to access the corresponding records (eg, rows of your array) using fancy indexing, as
my_array[matching].
to perform your computations on the records, as you want.
Note that you can keep your dates as strings or transform them into datetime objects using a user-defined converter, as described in the documentation. For example, your could transform a YYYY-MM-DD into a datetime object with a lambda s:datetime.dateime.strptime(s,"%Y-%m-%d"). That way, instead of having, say, a N array where each row (a record) consists of two dates as strings and 4 floats, you would have a N array where each row consists of two datetime objects and 4 floats.
Note the shape of your array (via my_array.shape), it says (N,), meaning it's a 1D array, even if it looks like a 2D table with multiple columns. You can access individual fields (each "column") by using its name. For example, if we create an array consisting of one string field called first and one int field called second, like that:
x = np.array([('a',1),('b',2)], dtype=[('first',"|S10"),('second',int)])
you could access the first column with
>>> x['first']
array(['a', 'b'],
dtype='|S10')