string "cross correlation" in matlab - matlab

Assume that I have 2 strings of characters:
AACCCGGAAATTTGGAATTTTCCCCAAATACG
CGATGATCGATGAATTTTAGCGGATACGATTC
I want to find by how much I should move the second string such that it matches the first one the most.
There are 2 cases. The first one is that we assume that the string are wrapped around, and the second one is that we don't.
Is there a matlab function that does returns either a N array or 2N+1 array of values for how much the shifted string 2 correlates with string 1?
If not, is there a faster/simpler method than something like
result = zeroes(length, 1)
for i = 0:length-1
result(i+1) = sum (str1 == circshift(str2, i));
end

You can convert each char into a binary column of size 4:
A -> [1;0;0;0]
C -> [0;1;0;0]
G -> [0;0;1;0]
T -> [0;0;0;1]
As a result a string of length n becomes a binary matrix of size 4-by-n.
You can now cross-correlate (along X axis only) the two n-by-4 and m-by-4 to get your result.

With a hat tip to John d'Errico:
str1 = 'CGATGATCGATGAATTTTAGCGGATACGATTC';
str2 = 'AACCCGGAAATTTGGAATTTTCCCCAAATACG';
% the circulant matrix
n = length(str2);
C = str2( mod(bsxfun(#plus,(0:n-1)',0:n-1),n)+1 ); %//'
% Find the maximum number of matching characters, and the amount
% by which to shift the string to achieve this result
[score, shift] = max( sum(bsxfun(#eq, str1, C), 2) );
Faster yes, simpler...well, I'll leave that up to you to decide :)
Note that this method trades memory for speed. That is, it creates the matrix of all possible shifts in memory (efficiently), and compares the string to all rows of this matrix. That matrix will contain N² elements, so if N becomes large, it's better to use the loop (or Shai's method).

Related

MATLAB function that gives all the positive integers in a column vector

I need to create a function that has the input argument n, a integer , n>1 , and an output argument v, which is a column vector of length n containing all the positive integers smaller than or equal to n, arranged in such a way that no element of the vector equals its own index.
I know how to define the function
This is what I tried so far but it doesn't work
function[v]=int_col(n)
[1,n] = size(n);
k=1:n;
v=n(1:n);
v=k'
end
Let's take a look at what you have:
[1,n] = size(n);
This line doesn't make a lot of sense: n is an integer, which means that size(n) will give you [1,1], you don't need that. (Also an expression like [1,n] can't be on the left hand side of an assignment.) Drop that line. It's useless.
k=1:n;
That line is pretty good, k is now a row vector of size n containing the integers from 1 to n.
v=n(1:n);
Doesn't make sense. n isn't a vector (or you can say it's a 1x1 vector) either way, indexing into it (that's what the parentheses do) doesn't make sense. Drop that line too.
v=k'
That's also a nice line. It makes a column vector v out of your row vector k. The only thing that this doesn't satisfy is the "arranged in such a way that no element of the vector equals its own index" part, since right now every element equals its own index. So now you need to find a way to either shift those elements or shuffle them around in some way that satisfies this condition and you'd be done.
Let's give a working solution. You should really look into it and see how this thing works. It's important to solve the problem in smaller steps and to know what the code is doing.
function [v] = int_col(n)
if n <= 1
error('argument must be >1')
end
v = 1:n; % generate a row-vector of 1 to n
v = v'; % make it a column vector
v = circshift(v,1); % shift all elements by 1
end
This is the result:
>> int_col(5)
ans =
5
1
2
3
4
Instead of using circshift you can do the following as well:
v = [v(end);v(1:end-1)];

How to perform XOR in a recursive scenario

I have a 1x5 char matrix. I need to perform a bitwise XOR operation on all the elements in the matrix.If T is the char matrix , I need a matrix T' such that
T'= T XOR (T-1)' for all T
T for T=1
Let the char matrix be T
T=['0000000000110111' '0000000001000001' '0000000001001010' '0000000010111000' '0000000000101111']
T'=['0000000000110111' '0000000001110110' '0000000000111100' '0000000010000100' '0000000010101011']
ie; Leaving the first element as such , I need to XOR all the other elements with the newly formed matrix. I tried the following code but I'm unable to get the correct result.
Yxor1d = [T(1) cellfun(#(a,b) char((a ~= b) + '0'), T(2:end), T'(1:end-1), 'UniformOutput', false)]
I need to perform the XOR operation such that , for obtaining the elements of T'
T' (2)= T(2) XOR T' (1)
T' (3)= T(3) XOR T' (2)
It'll be really helpful to know where I went wrong.Thanks.
You are using cellfun when a cell array is expected as the input. You are using a character array, and what you're actually doing is taking each of those 5 strings and creating a single character array out of them. Chaining those strings together is actually performing a character concatenation.
You probably don't want that. To fix this, all you have to do is make T a cell array by placing {} characters instead of array ([]) characters to declare your characters:
T={'0000000000110111' '0000000001000001' '0000000001001010' '0000000010111000' '0000000000101111'};
Because you have edited your post after I provided my answer, my previous answer using cellfun is now incorrect. Because you are using a recurrence relation where you are referring to the previous output rather than input, you can no longer use cellfun. You'll need to use a for loop. There are probably more elegant ways to do it, but this is the easiest if you want to get something working.
As such, initialize an output cell array that is the same size as the input cell array like above, then you'll need to initialize the first cell to be the first cell of the input, then iterate through each pair of input and output elements yourself.
So do something like this:
Yxor1d = cell(1,numel(T));
Yxor1d{1} = T{1};
for idx = 2 : numel(T)
Yxor1d{idx} = char(double(T{idx} ~= Yxor1d{idx-1}) + '0');
end
For each value i of T', we XOR with the current input at T{i} with the previous output of T'{i-1}.
Use the above and your input cell array T, we get:
Yxor1d =
Columns 1 through 3
'0000000000110111' '0000000001110110' '0000000000111100'
Columns 4 through 5
'0000000010000100' '0000000010101011'
This matches with your specifications in your modified post.
Edit: There is a solution without a loop:
T=['0000000000110111';'0000000001000001';'0000000001001010';'0000000010111000' ;'0000000000101111'];
Yxor = dec2bin(bi2de(mod(cumsum(de2bi(bin2dec(T))),2)),16)
Yxor =
0000000000110111
0000000001110110
0000000000111100
0000000010000100
0000000010101011
This uses the fact that you effectively want a cumulative xor operation on the elements of your array.
For N booleans it should be either any one of them or else all of them. So if you do a cumulative sum of each of your bits, the sum should be an odd number for a true answer to 'xor'.
The one liner above can be decomposed like that:
Y = bin2dec(T) ; %// convert char array T into decimal numbers
Y = de2bi( Y ) ; %// convert decimal array Tbin into array of "bit"
Y = cumsum(Y) ; %// do the cumulative sum on each bit column
Y = mod(Y,2) ; %// convert all "even" numbers to '0', and 'odd' numbers to '1'
Y = bi2de(Y) ; %// re-assemble the bits into decimal numbers
Yxor = dec2bin(Y,16) ; %// get their string representation
Note that if you are happy to handle arrays of bits (boolean) instead of character arrays, you can shave off a few lines from above ;-)
Initial answer (simpler to grasp, but with a loop):
You can use the bitxor function, but you have to convert your char array in numeric value first:
T=['0000000000110111';'0000000001000001';'0000000001001010' ;'0000000010111000' ;'0000000000101111'];
Tbin = bin2dec(T) ; %// convert to numeric values
Ybin = Tbin ; %// pre-assign result, then loop ...
for idx = 2 : numel(Tbin)
Ybin(idx) = bitxor( Ybin(idx) , Ybin(idx-1) ) ;
end
Ychar = dec2bin(Ybin,16) %// convert back to 16bit char array representation if necessary
Ychar =
0000000000110111
0000000001110110
0000000000111100
0000000010000100
0000000010101011
edited answer after you redefined your problem

Importing text file into matrix form with indexes as strings?

I'm new to Matlab so bear with me. I have a text file in this form :
b0002 b0003 999
b0002 b0004 999
b0002 b0261 800
I need to read this file and convert it into a matrix. The first and second column in the text file are analogous to row and column of a matrix(the indices). I have another text file with a list of all values of 'indices'. So it should be possible to create an empty matrix beforehand.
b0002
b0003
b0004
b0005
b0006
b0007
b0008
Is there anyway to access matrix elements using custom string indices(I doubt it but just wondering)? If not, I'm guessing the only way to do this is to assign the first row and first column the index string values and then assign the third column values based on the first text file. Can anyone help me with that?
You can easily convert those strings to numbers and then use those as indices. For a given string, b0002:
s = 'b0002'
str2num(s(2:end); % output = 2
Furthermore, you can also do this with a char matrix:
t = ['b0002';'b0003';'b0004']
t =
b0002
b0003
b0004
str2num(t(:,2:end))
ans =
2
3
4
First, we use textscan to read the data in as two strings and a float (could use other numerical formats. We have to open the file for reading first.
fid = fopen('myfile.txt');
A = textscan(fid,'%s%s%f');
textscan returns a cell array, so we have to extract your three variables. x and y are converted to single char arrays using cell2mat (works only if all the strings inside are the same length), n is a list of numbers.
x = cell2mat(A{1});
y = cell2mat(A{2});
n = A{3};
We can now convert x and y to numbers by telling it to take every row : but only the second to final part of the row 2:end, e.g 002, 003 , not b002, b003.
x = str2num(x(:,2:end));
y = str2num(y(:,2:end));
Slight problem with indexing - if I have a matrix A and I do this:
A = magic(8);
A([1,5],[3,8])
Then it returns four elements - [1,3],[5,3],[1,8],[5,8] - not two. But what you want is the location in your matrix equivalent to x(1),y(1) to be set to n(1) and so on. To do this, we need to 1) work out the final size of matrix. 2) use sub2ind to calculate the right locations.
% find the size
% if you have a specified size you want the output to be use that instead
xsize = max(x);
ysize = max(y);
% initialise the output matrix - not always necessary but good practice
out = zeros(xsize,ysize);
% turn our x,y into linear indices
ind = sub2ind([xsize,ysize],x,y);
% put our numbers in our output matrix
out(ind) = n;

How to change row number in a FOR loop... (MATLAB newbie)

I have a set of data that is <106x25 double> but this is inside a struct and I want to extract the data into a matrix. I figured a simple FOR loop would accomplish this but I have hit a road block quite quickly in my MATLAB knowledge.
This is the only piece of code I have, but I just don't know enough about MATLAB to get this simple bit of code working:
>> x = zeros(106,25); for i = 1:106, x(i,:) = [s(i).surveydata]; end
??? Subscripted assignment dimension mismatch.
's' is a very very large file (in excess of 800MB), it is a <1 x 106 struct>. Suffice it to say, I just need to access a small portion of this which is s.surveydata where most rows are a <1 x 25 double> (a row vector IIRC) and some of them are empty and solely return a [].
s.surveydata obviously returns the results for all of the surveydata contained where s(106).surveydata would return the result for the last row. I therefore need to grab s(1:106).surveydata and put it into a matrix x. Is creating the matrix first by using x = zeros(106,25) incorrect in this situation?
Cheers and thanks for your time!
Ryan
The easiest, cleanest, and fastest way to write all the survey data into an array is to directly catenate it, using CAT:
x = cat(1,s.surveydata);
EDIT: note that if any surveydata is empty, x will have fewer rows than s has elements. If you need x to have the same amount of rows as s has elements, you can do the following:
%# find which entries in s have data
%# note that for the x above, hasData(k) contains the
%# element number in s that the k-th row of x came from
hasData = find(arrayfun(#(x)~isempty(x.surveydata),s));
%# initialize x to NaN, so as to not confuse the
%# real data with missing data entries. The call
%# to hasData when indexing makes this robust to an
%# empty first entry in s
x = NaN(length(s),length(s(hasData(1)).surveydata);
%# fill in only the rows of x that contain data
x(hasData,:) = cat(1,s(hasData).surveydata);
No, creating an array of zeroes is not incorrect. In fact it's a good idea. You don't have to declare variables in Matlab before using them, but for loops, pre-allocating has speed benefits.
x = zeros(size(s), size(s(1)));
for i = 1:106
if ~isempty(s(i).surveydata)
x(i, :) = s(i).surveydata;
end
end
Should accomplish what you want.
EDIT: Since OP indicated that some rows are empty, I accounted for that like he said.
what about this?
what s is?
if s(i).surveydata is scalar:
x = zeros(106,25);
for i = 1:106
x(i,1) = [s(i).surveydata];
end
I am guessing that is what you want tough it is not clear at all :
if s(i).surveydata is row vector:
x = zeros(106,25);
for i = 1:106
x(i,:) = [s(i).surveydata];
end
if s(i).surveydata is column vector:
x = zeros(106,25);
for i = 1:106
x(i,:) = [s(i).surveydata]';
end

MATLAB converting a vector of values to uint32

I have a vector containing the values 0, 1, 2 and 3. What I want to do is take the lower two bits from each set of 16 elements drawn from this vector and append them all together to get one uint32. Anyone know an easy way to do this?
Follow-up: What if the number of elements in the vector isn't an integer multiple of 16?
Here's a vectorized version:
v = floor(rand(64,1)*4);
nWord = size(v,1)/16;
sum(reshape([bitget(v,2) bitget(v,1)]',[32 nWord]).*repmat(2.^(31:(-1):0)',[1 nWord ]))
To refine what was suggested by Jacob in his answer and mtrw in his comment, here's the most succinct version I can come up with (given a 1-by-N variable vec containing the values 0 through 3):
value = uint32(vec(1:16)*4.^(0:15)');
This treats the first element in the array as the least-significant bit in the result. To treat the first element as the most-significant bit, use the following:
value = uint32(vec(16:-1:1)*4.^(0:15)');
EDIT: This addresses the new revision of the question...
If the number of elements in your vector isn't a multiple of 16, then the last series of numbers you extract from it will have less than 16 values. You will likely want to pad the higher bits of the series with zeroes to make it a 16-element vector. Depending on whether the first element in the series is the least-significant bit (LSB) or most-significant bit (MSB), you will end up padding the series differently:
v = [2 3 1 1 3 1 2 2]; % A sample 8-element vector
v = [v zeros(1,8)]; % If v(1) is the LSB, set the higher bits to zero
% or...
v = [zeros(1,8) v]; % If v(1) is the MSB, again set the higher bits to zero
If you want to process the entire vector all at once, here is how you would do it (with any necessary zero-padding included) for the case when vec(1) is the LSB:
nValues = numel(vec);
nRem = rem(nValues,16);
vec = [vec(:) zeros(1,nRem)]; % Pad with zeroes
vec = reshape(vec,16,[])'; % Reshape to an N-by-16 matrix
values = uint32(vec*4.^(0:15)');
and when vec(1) is the MSB:
nValues = numel(vec);
nRem = rem(nValues,16);
vec = [vec(1:(nValues-nRem)) zeros(1,nRem) ...
vec((nValues-nRem+1):nValues)]; % Pad with zeroes
vec = reshape(vec,16,[])'; % Reshape to an N-by-16 matrix
values = uint32(fliplr(vec)*4.^(0:15)');
I think you should have a look at bitget and bitshift. It should be possible to be something like this (pseudo-matlab code as I haven't worked with Matlab for a long time):
result = 0;
for i = 1:16 do
result += bitshift(bitget(vector(i), 2:-1:1), 2);
Note that this will give you the last bits of the first vector in the highest bits, so you might want to descend i from 16 to 1 instead