Finding strings using an index - MATLAB - matlab

I have a character array list and wish to tally the number of substring occurrences against an index held in a numerical vector chr:
list =
CCNNCCCNNNCNNCN
chr =
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
Ordinarily, I am searching for adjacent string pairs i.e. 'NN' and utilise this method:
Count(:,1) = accumarray(chr(intersect([strfind(list,'CC')],find(~diff(chr)))),1);
Using ~diff(chr) to ensure the pattern matching does not cross index boundaries.
However, now I want to match single letter strings i.e. 'N' - how can I accomplish this? The above method means the last letter in each index is missed and not counted.
The desired result for the above example would be a two column matrix detailing the number of 'C's and 'N's within each index:
C N
2 2
5 6
i.e. there are 2C's and 2N's within index '1' (stored in chr) - the count then restarts from 0 for the next '2' - where there are 5C's and 6N's.

[u, ~, v] = unique(list); %// get unique labels for list in variable v
result = full(sparse(chr, v, 1)); %// accumulate combinations of chr and v
This works for an arbitrary number of letters in list, an arbitrary number of indices in chr, and chr not necessarily sorted.
In your example
list = 'CCNNCCCNNNCNNCN';
chr = [1 1 1 1 2 2 2 2 2 2 2 2 2 2 2].';
which produces
result =
2 2
5 6
The letter associated with each column of result is given by u:
u =
CN

Related

Can I get 2 set of random number array in matlab?

idx=randperm(5)
idx=[1,3,4,2,5]
I know this works like that but I'm curious about is there anyway to get something like this.
idx=[1,3,4,2,5,5,3,2,4,1]
adding one set of array after one array
Is there any way to make that?
One vectorized way would be to create a random array of size (m,n), sort it along each row and get the argsort indices. Each row of those indices would represent a group of randperm values. Here, m would be the number of groups needed and n being the number of elements in each group.
Thus, the implementation would look something like this -
[~,idx] = sort(rand(2,5),2);
out = reshape(idx.',1,[])
Sample run -
>> [~,idx] = sort(rand(2,5),2);
>> idx
idx =
5 1 3 2 4
4 3 2 5 1
>> out = reshape(idx.',1,[])
out =
5 1 3 2 4 4 3 2 5 1
You can use the modulo operation:
n = 5 %maximum value
r = 2 %each element are repeated r times.
res = mod(randperm(r*n),n)+1

Using a matrix as a index to perform functions on another - MATLAB

I have two matrices in this form:
ind=
1
1
1
1
2
2
2
2
2
3
3
type =
A
A
B
A
A
B
A
B
A
B
A
I want to be able to identify pairs of a specific kind i.e. A-B and B-A but not A-A. I have been able to do this using IF statements in this form:
if strcmp(type(m),'A') == 1 && strcmp(type(m+1),'B') == 1 && ind(m) == ind(m+1)
And so forth.
As hinted to within this IF statement, I need to be able to calculate how many valid pairs there are per index.
For example, the first four types AABA belong to index '1' because index '1' has a length of 4 as specified in ind. Here there are two valid pairs A-B and B-A. A-A is not a valid pair.
The desired output for the full above example would be:
2
4
1
Is there a quick and easy way to accomplish this?
EDIT:
If the types were expanded to include 'C' - and the system needs to detect non-unique pairs i.e. A-B, B-A but also B-B (but nothing containing C) - could this be done? Is there a way to specify which pairs are being counted each time?
You can try:
ind = [1 1 1 1 2 2 2 2 2 3 3]';
type = 'AABAABABABA';
accumarray(ind(intersect([strfind(type,'AB'),strfind(type,'BA')],find(~diff(ind)))),1)
output:
ans =
2
4
1
If I recall correctly, arrayfun is actually kind of slow. I don't think it actually vectorizes the code. Anyway, the idea is to find 'AB' and 'BA' with strfind and then merge the indices together. However, you cannot count 'AB' and 'BA' across index boundaries, so intersect with find(~diff(ind)) will make sure only valid indices are kept. Then accumarray will accumulate all the indices together with ind for the answer you want.
Try this:
>> arrayfun(#(x) sum(diff(type(ind == x)) ~= 0), unique(ind))
ans =
2
4
1

comparing rows of matrix and constructing 1D array in matlab?

i have [sentences*words] matrix in which rows are labeled as sentences and columns with words the code i used for that is:
Out = NaN(numel(sentences), numel(out_words));
for i = 1:numel(out_words)
Out(:,i) = cellfun(#(x) numel(strfind(x, out_words{i})), sentences);
end
display(Out)
above code returna a logical matrix a sample example below illustrates the idea:
1 0 1
1 1 0
0 1 1
1 0 1
in above rows are sentences and columns are words, if a word is present in a sentence 1 is written else 0 is written.
Now what i want to do is to compare the rows and save all the locations that have 1 in commmon for example in above row 1 should me compared with all the other remaining rows and row2 with all the remaining till the nth row this operation should be carried out which should save the result in 1D array as follows:
for example:
output=
sentence {1,2} contain red
sentence {1,4} contain red,say
sentence {2,3} contain but
sentence {1,3} contain say
and so on up till n elements
sentence{1,2} 1 is refered to sentence1 and 2 is sentence2 and so on till nth sentence i want to compare the rows and pick the locations on which two words have 1(true) value.
if some one can give a better idea of implementing equality relation for matrices please suggest me, thank you
You can use bsxfun to compare the sentences. Let M be the logical matrix of size #sentences-by-#words, then
cmp = bsxfun( #eq, permute(M,[1 3 2]), permute(M,[3 1 2]) )
Now you have a logical array cmp of size #sentences-by-#sentences-by-#words where the vector v_ij = cmp( ii, jj, : ) has v_ij(k) = true iff sentence ii and sentence jj has the word k in them.

How to restructure histcounts for using with a 2d-matrix

I have a 250000x2-matrix in matlab, where in the first row I have a degree (int, 0-360°), and in the second a float-value corresponding to this value. My target is to count each occurence of a degree-value-pair (e.g. a row), and write the result in a nx3-matrix. n corresponds here with the number unique rows.
Thus my first step was to get all unique values (using unique(M, 'rows')) which works. But now I want to count all unique values. This was done by the following approach:
uniqu_val = unique(values, 'rows');
instance = histcounts(values(:), uniqu_val);
Here I have to enter a vector as second element, and not a matrix (uniqu_val is a nx2-dim-matrix). But I want to get the number of occurence for each unique row, therefore I can not use only one column of the matrix uniqu_val. In short: I want to use histcounts not only for a 1D-matrix as edge-value, but for a 2D-matrix. How can I solve this problem?
You can use the third output from unique and then use histcounts like so -
%// Find the unique rows and keep the order with 'stable' option
[uniq_val,~,row_labels] = unique(values, 'rows','stable')
%// Find the counts/instances
instances = histcounts(row_labels, max(row_labels))
%// OR with HISTC: instances = histc(row_labels, 1:max(row_labels))
%// Output the unique rows alongwith the counts
out = [uniq_val instances(:)]
Sample run -
>> values
values =
2 1
3 1
2 3
3 3
1 2
3 3
1 3
3 1
3 2
1 2
>> out
out =
2 1 1
3 1 2
2 3 1
3 3 2
1 2 2
1 3 1
3 2 1

Filtering sequences in MATLAB

Is it possible to do something like regular expressions with MATLAB to filter things out? Basically I'm looking for something that will let me take a vector like:
[1 2 1 1 1 2 1 3 3 3 3 1 1 4 4 4 1 1]
and will return:
[3 3 3 3 4 4 4]
These are the uninterrupted sequences (where there's no interspersion).
Is this possible?
Using regular expressions
Use MATLAB's built-in regexp function for regular expression matching. However, you have to convert the input array to a string first, and only then feed it to regexp:
C = regexp(sprintf('%d ', x), '(.+ )(\1)+', 'match')
Note that I separated the values with spaces so that regexp can match multiple digit numbers as well. Then convert the result back to a numerical array:
res = str2num([C{:}])
The dot (.) in the pattern string represents any character. To find sequences of certain digits only, specify them in brackets ([]). For instance, the pattern to find only sequences of 3 and 4 would be:
([34]+ )(\1)+
A simpler approach
You can filter out successively repeating values by checking the similarity between adjacent elements using diff:
res = x((diff([NaN; x(:)])' == 0) | (diff([x(:); NaN])' == 0))
Optionally, you can keep only certain values from the result, for example:
res(res == 3 | res == 4)
You can do it like this:
v=[1 2 1 1 1 2 1 3 3 3 3 1 1 4 4 4 1 1];
vals=unique(v); % find all unique values in the vector
mask=[]; % mask of values to retain
for i=1:length(vals)
indices=find(v==vals(i)); % find indices of each unique value
% if the maximum difference between indices containing
% a given value is 1, it is contiguous
% --> add this value to the mask
if max(indices(2:end)-indices(1:end-1))==1
mask=[mask vals(i)];
end
end
% filter out what's necessary
vproc=v(ismember(v,mask))
Result:
vproc =
3 3 3 3 4 4 4
This can be another approach, although a little bit too elaborated.
If you see the plot of your array, you want to retain its level sets (i.e. a == const) which are topologically connected (i.e. made by one piece).
Coherently, such level sets are exactly the ones corresponding to a==3 and a==4.
Here is a possible implementation
a = [1 2 1 1 1 2 1 3 3 3 3 1 1 4 4 4 1 1]
r = []; % result
b = a; % b will contain the union of the level sets not parsed yet
while ~isempty(b)
m = a == b(1); % m is the current level set
connected = sum(diff([0 m 0]).^2) == 2; % a condition for being a connected set:
% the derivative must have 1 positive and 1
% negative jump
if connected == true % if the level set is connected we add it to the result
r = [r a(m)];
end
b = b(b~=b(1));
end
If you try something like
a = [1 2 1 1 1 2 1 3 3 3 3 1 1 4 4 4 1 1] %initial vector
b = a>=3 %apply filter condition
a = a(b) %keep values that satisfy filter
a will output
a = [3 3 3 3 4 4 4]