Replacing Several Different Character of a string - matlab

I have to write a function to replace the characters of a string with those letters.
A=U
T=A
G=C
C=G
Example:
Input: 'ATAGTACCGGTTA'
Therefore, the output should be:
'UAUCAUGGCCAAU'
I can replace only one character. However, I have no how to do several. I could replace several if '"G=C and C=G" this condition was not there.
I use:
in='ATAGTACCGGTTA'
check=in=='A'
in(check)='U'
ans='UTUGTUCCGGTTU'
if I keep doing this at some point G will be replaced by C then then all the C will be replaced by G. How can I stop this?? Any help will be appreciated.

Just for fun, here's probably the absolute simplest way, via indexing:
key = 'UGCA';
[~, ~, idx] = unique(in);
out = key(idx'); % transpose idx since unique() returns a column vector
I do love indexing :D
Edit: As rightly pointed out, this is very optimised for the question as stated. Since [a, ~, idx] = unique(in); returns a and idx such that a(idx) == in, and by default a is sorted, we can just assume that a == 'ACGT' and pre-construct key to be the appropriate translation of indices into a.
If some characters from the known alphabet never appear in the input string, or if other unknown characters appear, then the indices don't match and the assumption breaks. In that case, we have to calculate the appropriate key explicitly - filling in the step that was optimised out above:
alph = 'ACGT';
trans = 'UGCA';
[key, ~, idx] = unique(in);
[~, alphidx, keyidx] = intersect(alph, key); % find which elements of alph
% appear at which points in key
key(keyidx) = trans(alphidx); % translate the elements of key that we can
out = key(idx');

The simplest way would be to use an intermediary letter. For instance:
in='ATAGTACCGGTTA'
in(in == 'A')='U'
in(in == 'T')='A'
in(in == 'C')='X'
in(in == 'G')='C'
in(in == 'X')='G'
This way you keep the 'C' and 'G' characters separate.
EDIT:
As others have mentioned, there are a few things other things you could do to improve this approach (though personally I think Notlikethat's way is cleanest). For instance, if you use a second variable, you don't have to worry about keeping 'C' and 'G' separate:
in='ATAGTACCGGTTA'
out=in;
out(in == 'A')='U';
out(in == 'T')='A';
out(in == 'C')='G';
out(in == 'G')='C';
Alternatively, you could make your indices first, then index after:
in='ATAGTACCGGTTA'
inA=in=='A';
inT=in=='T';
inC=in=='C';
inG=in=='G';
in(inA)='U';
in(inT)='A';
in(inC)='G';
in(inG)='C';
Finally, my personal favourite for sheer idiocy:
out=char(in+floor((68-in).*(in<70)*7/4)*4-round(ceil((in-67)/4)*3.7));
(Seriously, that last one works)

You can perform multiple character translation with bsxfun.
Inputs:
in = 'ATAGTACCGGTTA';
pat = ['A','T','G','C'];
subst = ['U','A','C','G'];
out0 ='UAUCAUGGCCAAU';
Translate all characters simultaneously:
>> ii = (1:numel(pat))*bsxfun(#eq,in,pat.'); %' instead of repmat and .*
>> out = subst(ii)
out =
UAUCAUGGCCAAU
>> isequal(out,out0)
ans =
1
Say you only want to translate a subset of the characters, leaving part of the sequence intact, it is easily solved with logical indexing and a few extra lines:
% Leave the Gs and Cs in place
pat = ['A','T'];
subst = ['U','A'];
ii = (1:numel(pat))*bsxfun(#eq,in,pat.'); %' same
out = char(zeros(1,numel(in)));
nz = ii>0;
out(nz) = subst(ii(nz));
out(~nz) = in(~nz)
out =
UAUGAUCCGGAAU
The original Gs and Cs are unchanged; A became U, and T became A (T is gone).

I would suggest to use containter.Map:
m=containers.Map({'A','T','G','C'},{'U','A','C','G'})
mapfkt=#(input)(cell2mat(m.values(num2cell(input))))
Usage:
mapfkt('ATAGTACCGGTTA')

Here is another method that should be fairly efficient, general, and in the line of thought of your original attempt:
%Suppose this is your input
myString = 'abcdeabcde';
fromSting = 'ace';
toString = 'xyz';
%Then it just takes this:
[idx fromLocation] = ismember(myString,fromSting)
myString(idx)=toString(fromLocation(idx))
If you know that all letters need to be replaced, the last line can be slightly simplified as you wont need to use idx.

Related

find string in non-scalar structure matlab

Here's a non-scalar structure in matlab:
clearvars s
s=struct;
for id=1:3
s(id).wa='nko';
s(id).test='5';
s(id).ad(1,1).treasurehunt='asdf'
s(id).ad(1,2).treasurehunt='as df'
s(id).ad(1,3).treasurehunt='foobar'
s(id).ad(2,1).treasurehunt='trea'
s(id).ad(2,2).treasurehunt='foo bar'
s(id).ad(2,3).treasurehunt='treasure'
s(id).ad(id,4).a=magic(5);
end
is there an easy way to test if the structure s contains the string 'treasure' without having to loop through every field (e.g. doing a 'grep' through the actual content of the variable)?
The aim is to see 'quick and dirtily' whether a string exists (regardless of where) in the structure. In other words (for Linux users): I'd like to use 'grep' on a matlab variable.
I tried arrayfun(#(x) any(strcmp(x, 'treasure')), s) with no success, output:
ans =
1×3 logical array
0 0 0
One general approach (applicable to any structure array s) is to convert your structure array to a cell array using struct2cell, test if the contents of any of the cells are equal to the string 'treasure', and recursively repeat the above for any cells that contain structures. This can be done in a while loop that stops if either the string is found or there are no structures left to recurse through. Here's the solution implemented as a function:
function found = string_hunt(s, str)
c = reshape(struct2cell(s), [], 1);
found = any(cellfun(#(v) isequal(v, str), c));
index = cellfun(#isstruct, c);
while ~found && any(index)
c = cellfun(#(v) {reshape(struct2cell(v), [], 1)}, c(index));
c = vertcat(c{:});
found = any(cellfun(#(c) isequal(c, str), c));
index = cellfun(#isstruct, c);
end
end
And using your sample structure s:
>> string_hunt(s, 'treasure')
ans =
logical
1 % True!
This is one way to avoid an explicit loop
% Collect all the treasurehunt entries into a cell with strings
s_cell={s(1).ad.treasurehunt, s(2).ad.treasurehunt, s(3).ad.treasurehunt};
% Check if any 'treasure 'entries exist
find_treasure=nonzeros(strcmp('treasure', s_cell));
% Empty if none
if isempty(find_treasure)
disp('Nothing found')
else
disp(['Found treasure ',num2str(length(find_treasure)), ' times'])
end
Note that you can also just do
% Collect all the treasurehunt entries into a cell with strings
s_cell={s(1).ad.treasurehunt, s(2).ad.treasurehunt, s(3).ad.treasurehunt};
% Check if any 'treasure 'entries exist
find_treasure=~isempty(nonzeros(strcmp('treasure', s_cell)));
..if you're not interested in the number of occurences
Depending on the format of your real data, and if you can find strings that contain your string:
any( ~cellfun('isempty',strfind( arrayfun( #(x)[x.ad.treasurehunt],s,'uni',0 ) ,str)) )

finding the number of occurrence of a pattern within a cell in matlab?

i have a cell like this:
x = {'3D'
'B4'
'EF'
'D8'
'E7'
'6C'
'33'
'37'}
let's assume that the cell is 1000x1. i want to find the number of occurrence of pattern = [30;30;64;63] within this cell but as the order shown. in the other word it's first check x{1,1},x{2,1},x{3,1},x{4,1}
then check x{2,1},x{3,1},x{4,1},x{5,1} and like this till the end of the cell and return the number of occurrence of it.
Here is my code but it didn't work!
while (size (pattern)< size(x))
count = 0;
for i=1:size(x)-length(pattern)+1
if size(abs(x(i:i+size(pattern)-1)-x))==0
count = count+1;
end
end
end
Your example code has a couple of issues - foremost I don't believe you are doing any comparison operations, which would be necessary to identify the occurrence of the pattern within the search data (x). Also, there is a variable type mismatch between x and pattern - one is a cell array of strings, and the other is a decimal array.
One way to approach this problem would be to restructure x and pattern as strings, and then use strfind to find occurrences of pattern. This method will only work if there is no missing data in either of the variables.
x = {'3D';'B4';'EF';'D8';'E7';'6C';'33';'37';'xE';'FD';'8y'};
pattern = {'EF','D8'};
collated_x=[x{:}];
collated_pattern = [pattern{:}];
found_locations = strfind(collated_x, collated_pattern);
% Remove 'offset' matches that start at even locations
found_locations = found_locations(mod(found_locations,2)==1);
count = length(found_locations)
Use string find function.
This is fast and simple solution:
clear
str_pattern=['B4','EF']; %same as str_pattern=['B4EF'];
x = {'3D'
'B4'
'EF'
'D8'
'EB'
'4E'
'F3'
'B4'
'EF'
'37'} ;
str_x=horzcat(x{:});
inds0=strfind(str_x,str_pattern); %including in-middle
inds1=inds0(bitand(inds0,1)==1); %exclude all in-middle results
disp(str_x);
disp(str_pattern);
disp(inds0);
disp(inds1);

Counting occurrences of a character in a string within a cell

I'm having trouble figuring out how to count the occurrences of a character in a string within a cell. For example, I have a file that contains information like so:
type
m
mmNs
SmNm
and I'm trying to determine how many m's are in each line. To do this, I've tried this code:
sampleddata = dataset('file','sample.txt','Delimiter','\t');
muts = sampleddata.type;
fileID = fopen('number_occur.txt','w');
for j = 1:3
mutations = muts(j)
M = length(find(mutations == 'm'));
fprintf(fileID, '%1f\n',M)
end
fclose(fileID)
However, I get an error that informs me: "Undefined operator '==' for input arguments of type 'cell'." Does anyone know how to overcome this problem?
Gonna post a result here in case you did not find a way to do it. There are loads of ways to do it, I am just going to put one of them.
Basically, you want a regex to do string matches:
a = {'type';
'm';
'mmNs';
'SmNm';
'mmmmM'} %//Load in Data,
pattern = 'm'; %//The pattern you are looking for is 'm', it could be anything really, a number of specific word or a specific pattern
lines = regexp(a, pattern, 'tokens'); %// look for this pattern in each line
result = cellfun('length',lines); %//count the size of matched patterns, so each time it matches, the size should increase by 1.
This gives the result in a matrix form:
result =
0
1
2
2
4

Convert matlab symbol to array of products

Can I convert a symbol that is a product of products into an array of products?
I tried to do something like this:
syms A B C D;
D = A*B*C;
factor(D);
but it doesn't factor it out (mostly because that isn't what factor is designed to do).
ans =
A*B*C
I need it to work if A B or C is replaced with any arbitrarily complicated parenthesized function, and it would be nice to do it without knowing what variables are in the function.
For example (all variables are symbolic):
D = x*(x-1)*(cos(z) + n);
factoring_function(D);
should be:
[x, x-1, (cos(z) + n)]
It seems like a string parsing problem, but I'm not confident that I can convert back to symbolic variables afterwards (also, string parsing in matlab sounds really tedious).
Thank you!
Use regexp on the string to split based on *:
>> str = 'x*(x-1)*(cos(z) + n)';
>> factors_str = regexp(str, '\*', 'split')
factors_str =
'x' '(x-1)' '(cos(z) + n)'
The result factor_str is a cell array of strings. To convert to a cell array of sym objects, use
N = numel(factors_str);
factors = cell(1,N); %// each cell will hold a sym factor
for n = 1:N
factors{n} = sym(factors_str{n});
end
I ended up writing the code to do this in python using sympy. I think I'm going to port the matlab code over to python because it is a more preferred language for me. I'm not claiming this is fast, but it serves my purposes.
# Factors a sum of products function that is first order with respect to all symbolic variables
# into a reduced form using products of sums whenever possible.
# #params orig_exp A symbolic expression to be simplified
# #params depth Used to control indenting for printing
# #params verbose Whether to print or not
def factored(orig_exp, depth = 0, verbose = False):
# Prevents sympy from doing any additional factoring
exp = expand(orig_exp)
if verbose: tabs = '\t'*depth
terms = []
# Break up the added terms
while(exp != 0):
my_atoms = symvar(exp)
if verbose:
print tabs,"The expression is",exp
print tabs,my_atoms, len(my_atoms)
# There is nothing to sort out, only one term left
if len(my_atoms) <= 1:
terms.append((exp, 1))
break
(c,v) = collect_terms(exp, my_atoms[0])
# Makes sure it doesn't factor anything extra out
exp = expand(c[1])
if verbose:
print tabs, "Collecting", my_atoms[0], "terms."
print tabs,'Seperated terms with ',v[0], ', (',c[0],')'
# Factor the leftovers and recombine
c[0] = factored(c[0], depth + 1)
terms.append((v[0], c[0]))
# Combines trivial terms whenever possible
i=0
def termParser(thing): return str(thing[1])
terms = sorted(terms, key = termParser)
while i<len(terms)-1:
if equals(terms[i][1], terms[i+1][1]):
terms[i] = (terms[i][0]+terms[i+1][0], terms[i][1])
del terms[i+1]
else:
i += 1
recombine = sum([terms[i][0]*terms[i][1] for i in range(len(terms))])
return simplify(recombine, ratio = 1)

How do I concatenate the empty String to the beginning of a String in Matlab

Say I typed x = 'BODD' into the command prompt of MATLAB and then said x(1) it would return B. What I want is x(1) to return the empty String ('') or nothing etc. and x(2) to return B and so forth up until x(5) returning the final D?
I assume that you mean that you really do want the empty zero-length string, ''. There have been some Answers to this question that assume that you meant that you wanted the one-character string that contains a space, ASCII value 32.
If that's the case, I'm afraid you can't to that - MATLAB arrays (including character arrays, which is all that a MATLAB "string" is) don't work that way. There are two ways to look at it...
You asked for x(1). Now, the indexing expression that you used, 1, has size 1x1. Therefore, you are guaranteed to get either a 1x1 return value, OR an error. That means that there's no way to get a 0x1 or 0x0 (the true "empty string"). This is similar to the way that, if you had asked for x(2:4), you would be guaranteed to get a 1x3 array of characters back. In that case, 2:4 is a 1x3 array.
There's no way to "meaningfully" prepend a zero-length string to the beginning of another string. If a = 'WXYZ';, then running b = ['' a] just returns 'WXYZ' back. It didn't somehow stick a magical placeholder for an empty string at the front of the original string.
You can't concatenate '' at end or beginning
However, you can have blank/space, like this :-
>> x=BODD;
>> x=[' ' x]; % Use normal matrix concatenation
>> x(1)
ans =
>> x(2)
ans =
B
Try following concatenation
x = [' ' x];
If you want the string itself to still be 'BODD', you could try writing a custom function:
function [char] = emptyConcat(string, index)
if (index == 1)
char = '';
else
char = string(index - 1);