How to calculate possible word subsequences matching a pattern? - matlab

Suppose I have a sequence:
Seq = 'hello my name'
and a string:
Str = 'hello hello my friend, my awesome name is John, oh my god!'
And then I look for matches for my sequence within the string, so I get the "word" index of each match for each word of the sequence in a cell array, so the first element is a cell containing the matches for 'hello', the second element contains the matches for 'my' and the third for 'name'.
Match = {[1 2]; %'hello' matches
[3 5 11]; %'my' matches
[7]} %'name' matches
I need code to somehow get an answer saying that possible sub-sequence matches are:
Answer = [1 3 7; %[hello my name]
1 5 7; %[hello my name]
2 3 7; %[hello my name]
2 5 7;] %[hello my name]
In such a way that "Answer" contains all possible ordered sequences (that's why my(word 11) never appears in "Answer", there would have to be a "name" match after position 11.
NOTE: The length and number of matches of "Seq" may vary.

Since the length of Matches may vary, you need to use comma-separated lists, together with ndgrid to generate all combinations (the approach is similar to that used in this other answer). Then filter out combinations where the indices are not increasing, using diff and logical indexing:
cc = cell(1,numel(Match)); %// pre-shape to be used for ndgrid output
[cc{end:-1:1}] = ndgrid(Match{end:-1:1}); %// output is a comma-separated list
cc = cellfun(#(v) v(:), cc, 'uni', 0) %// linearize each cell
combs = [cc{:}]; %// concatenate into a matrix
ind = all(diff(combs.')>0); %'// index of wanted combinations
combs = combs(ind,:); %// remove unwanted combinations
The desired result is in the variable combs. In your example,
combs =
1 3 7
1 5 7
2 3 7
2 5 7

Related

How to extract vectors of consecutive numbers?

Suppose that I have a Q vector which is defined as Q = [1 2 3 4 5 8 9 10 15]; and I would like to find a way to extract different vectors of consecutive numbers and also a vector for the rest of the elements. So my result would be like:
q1 = [1 2 3 4 5];
q2 = [8 9 10 ];
q3 = [15];
You can do this using diff, cumsum and accumarray:
q = accumarray(cumsum([1, diff(Q)~=1])', Q', [], #(x){x})
which returns:
{[1,2,3,4,5];
[8,9,10];
[15]}
i.e. q{1} gives you [1,2,3,4,5] etc which is a far cleaner solution to having separately named vectors. But if you really really wanted to have them, and you know exactly how many groups you will get out, you can do it as follows:
[q1,q2,q3] = q{:};
Explanation:
accumarray will apply an aggregation function (4th input) to elements of a vector (2nd input) based on groupings specified in another vector (1st input).
To use the notation in the docs:
sub = cumsum([1, diff(Q)~=1])';
val = Q';
fun = #(x){x};
Note that sub needs to start from 1. The idea is to use diff to find elements that are consecutive (i.e. where Q(i+1) - Q(i) == 1) which is vectorized using the diff function. By specifying diff(Q)~=1 we can find the breaks between groups of consecutive numbers (concatenating the 1 at the beginning to force a break at the start). cumsum then just converts these breaks into vector of in the right form for sub i.e.
sub = [1 1 1 1 1 2 2 2 3]
The aggregation function we specify is just cell concatenation.

Matlab: find second argmax [duplicate]

How do I find the index of the 2 maximum values of a 1D array in MATLAB? Mine is an array with a list of different scores, and I want to print the 2 highest scores.
You can use sort, as #LuisMendo suggested:
[B,I] = sort(array,'descend');
This gives you the sorted version of your array in the variable B and the indexes of the original position in I sorted from highest to lowest. Thus, B(1:2) gives you the highest two values and I(1:2) gives you their indices in your array.
I'll go for an O(k*n) solution, where k is the number of maximum values you're looking for, rather than O(n log n):
x = [3 2 5 4 7 3 2 6 4];
y = x; %// make a copy of x because we're going to modify it
[~, m(1)] = max(y);
y(m(1)) = -Inf;
[~, m(2)] = max(y);
m =
5 8
This is only practical if k is less than log n. In fact, if k>=3 I would put it in a loops, which may offend the sensibilities of some. ;)
To get the indices of the two largest elements: use the second output of sort to get the sorted indices, and then pick the last two:
x = [3 2 5 4 7 3 2 6 4];
[~, ind] = sort(x);
result = ind(end-1:end);
In this case,
result =
8 5

Finding strings using an index - MATLAB

I have a character array list and wish to tally the number of substring occurrences against an index held in a numerical vector chr:
list =
CCNNCCCNNNCNNCN
chr =
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
Ordinarily, I am searching for adjacent string pairs i.e. 'NN' and utilise this method:
Count(:,1) = accumarray(chr(intersect([strfind(list,'CC')],find(~diff(chr)))),1);
Using ~diff(chr) to ensure the pattern matching does not cross index boundaries.
However, now I want to match single letter strings i.e. 'N' - how can I accomplish this? The above method means the last letter in each index is missed and not counted.
The desired result for the above example would be a two column matrix detailing the number of 'C's and 'N's within each index:
C N
2 2
5 6
i.e. there are 2C's and 2N's within index '1' (stored in chr) - the count then restarts from 0 for the next '2' - where there are 5C's and 6N's.
[u, ~, v] = unique(list); %// get unique labels for list in variable v
result = full(sparse(chr, v, 1)); %// accumulate combinations of chr and v
This works for an arbitrary number of letters in list, an arbitrary number of indices in chr, and chr not necessarily sorted.
In your example
list = 'CCNNCCCNNNCNNCN';
chr = [1 1 1 1 2 2 2 2 2 2 2 2 2 2 2].';
which produces
result =
2 2
5 6
The letter associated with each column of result is given by u:
u =
CN

splitting a Matrix into column vectors and storing it in an array

My question has two parts:
Split a given matrix into its columns
These columns should be stored into an array
eg,
A = [1 3 5
3 5 7
4 5 7
6 8 9]
Now, I know the solution to the first part:
the columns are obtained via
tempCol = A(:,iter), where iter = 1:end
Regarding the second part of the problem, I would like to have (something like this, maybe a different indexing into arraySplit array), but one full column of A should be stored at a single index in splitArray:
arraySplit(1) = A(:,1)
arraySplit(2) = A(:,2)
and so on...
for the example matrix A,
arraySplit(1) should give me [ 1 3 4 6 ]'
arraySplit(2) should give me [ 3 5 5 8 ]'
I am getting the following error, when i try to assign the column vector to my array.
In an assignment A(I) = B, the number of elements in B and I must be the same.
I am doing the allocation and access of arraySplit wrongly, please help me out ...
Really it sounds like A is alread what you want--I can't imagine a scenario where you gain anything by splitting them up. But if you do, then your best bet is likely a cell array, ie.
C = cell(1,3);
for i=1:3
C{i} = A(:,i);
end
Edit: See #EitanT's comment below for a more elegant way to do this. Also accessing the vector uses the same syntax as setting it, e.g. v = C{2}; will put the second column of A into v.
In a Matlab array, each element must have the same type. In most cases, that is a float type. An your example A(:, 1) is a 4 by 1 array. If you assign it to, say, B(:, 2) then B(:, 1) must also be a 4 by 1 array.
One common error that may be biting you is that a 4 by 1 array and a 1 by 4 array are not the same thing. One is a column vector and one is a row vector. Try transposing A(:, 1) to get a 1 by 4 row array.
You could try something like the following:
A = [1 3 5;
3 5 7;
4 5 7;
6 8 9]
arraySplit = zeros(4,1,3);
for i =1:3
arraySplit(:,:,i) = A(:,i);
end
and then call arraySplit(:,:,1) to get the first vector, but that seems to be an unnecessary step, since you can readily do that by accessing the exact same values as A(:,1).

How to convert numeric results in symbols or strings?

this is my problem.
I made an algorithm that makes permutations of certain words. I substituted each word with a numeric value so I can make arithmetical operations with them (e.g. 1 = 'banana' 2 = 'child' 3 = 'car' 4 = 'tree' etc.).
Let's say that after running an algorithm, matlab gave me this matrix as result:
ans = [2,2,1; 4,3,3]
What I never can figure out is how to tell him - substitute digits with symbols and write:
ans = [child,child,banana; tree,car,car] - so I don't have to look up every number in my chart and replace it with a corresponding word!?
If you have an array with your words, and another array with the indices, you can produce an array that replaces every index with the corresponding word like so:
words = {'banana','child','car','tree'};
numbers = [2 2 1;4 3 3];
>> words(numbers)
ans =
'child' 'child' 'banana'
'tree' 'car' 'car'
You can also use the ordinal datatype if you have the statistics toolbox.
>> B = ordinal([2 2 0; 4 3 3], {'banana','child','car','tree'})
B =
child child banana
tree car car
Note that it handles zeros automatically. Then you can do things like:
>> B=='child'
ans =
1 1 0
0 0 0