Mapping letters to integers in MATLAB - matlab

The function arithenco needs the input message to be a sequence of positive integers. Hence, I need convert a message into a sequence of numbers message_int, by using the following mapping.
‘A’→1, ‘C’→2, ‘G’→3, ‘T’→4.

From what I understand, the alphabet you are using contains only four values A,C,G,T (DNA sequences I suppose).
Simple comparison would suffice:
seq = 'TGGAGGCCCACAACCATTCCCTCAGCCCAATTGACCGAAAGGGCGCGA';
msg_int = zeros(size(seq));
msg_int(seq=='A') = 1;
msg_int(seq=='C') = 2;
msg_int(seq=='G') = 3;
msg_int(seq=='T') = 4;

Oh, just reread your question: your mapping is not so simple. Sorry.
(since darvidsOn wrote the same I won't delete this answer - it might give you a start - but it doesn't answer your question completely).
Have a look at http://www.matrixlab-examples.com/ascii-chart.html
You can use d = double('A') to convert a char into a double- you will then need to subtract 64 to get the mapping that you want (because A is ascii code 65).

Related

Fastest type to use for comparing hashes in matlab

I have a table in Matlab with some columns representing 128 bit hashes.
I would like to match rows, to one or more rows, based on these hashes.
Currently, the hashes are represented as hexadecimal strings, and compared with strcmp(). Still, it takes many seconds to process the table.
What is the fastest way to compare two hashes in matlab?
I have tried turning them into categorical variables, but that is much slower. Matlab as far as I know does not have a 128 bit numerical type. nominal and ordinal types are deprecated.
Are there any others that could work?
The code below is analogous to what I am doing:
nodetype = { 'type1'; 'type2'; 'type1'; 'type2' };
hash = {'d285e87940fb9383ec5e983041f8d7a6'; 'd285e87940fb9383ec5e983041f8d7a6'; 'ec9add3cf0f67f443d5820708adc0485'; '5dbdfa232b5b61c8b1e8c698a64e1cc9' };
entries = table(categorical(nodetype),hash,'VariableNames',{'type','hash'});
%nodes to match. filter by type or some other way so rows don't match to
%themselves.
A = entries(entries.type=='type1',:);
B = entries(entries.type=='type2',:);
%pick a node/row with a hash to find all counterparts of
row_to_match_in_A = A(1,:);
matching_rows_in_B = B(strcmp(B.hash,row_to_match_in_A.hash),:);
% do stuff with matching rows...
disp(matching_rows_in_B);
The hash strings are faithful representations of what I am using, but they are not necessarily read or stored as strings in the original source. They are just converted for this purpose because its the fastest way to do the comparison.
Optimization is nice, if you need it. Try it out yourself and measure the performance gain for relevant test cases.
Some suggestions:
Sorted arrays are easier/faster to search
Matlab's default numbers are double, but you can also construct integers. Why not use 2 uint64's instead of the 128bit column? First search for the upper 64bit, then for the lower; or even better: use ismember with the row option and put your hashes in rows:
A = uint64([0 0;
0 1;
1 0;
1 1;
2 0;
2 1]);
srch = uint64([1 1;
0 1]);
[ismatch, loc] = ismember(srch, A, 'rows')
> loc =
4
2
Look into the compare functions you use (eg edit ismember) and strip out unnecessary operations (eg sort) and safety checks that you know in advance won't pose a problem. Like this solution does. Or if you intend do call a search function multiple times, sort in advance and skip the check/sort in the search function later on.

Read specific portions of an excel file based on string values in MATLAB

I have an excel file and I need to read it based on string values in the 4th column. I have written the following but it does not work properly:
[num,txt,raw] = xlsread('Coordinates','Centerville');
zn={};
ctr=0;
for i = 3:size(raw,1)
tf = strcmp(char(raw{i,4}),char(raw{i-1,4}));
if tf == 0
ctr = ctr+1;
end
zn{ctr}=raw{i,4};
end
data=zeros(1,10); % 10 corresponds to the number of columns I want to read (herein, columns 'J' to 'S')
ctr=0;
for j = 1:length(zn)
for i=3:size(raw,1)
tf=strcmp(char(raw{i,4}),char(zn{j}));
if tf==1
ctr=ctr+1;
data(ctr,:,j)=num(i-2,10:19);
end
end
end
It gives me a "15129x10x22 double" thing and when I try to open it I get the message "Cannot display summaries of variables with more than 524288 elements". It might be obvious but what I am trying to get as the output is 'N = length(zn)' number of matrices which represent the data for different strings in the 4th column (so I probably need a struct; I just don't know how to make it work). Any ideas on how I could fix this? Thanks!
Did not test it, but this should help you get going:
EDIT: corrected wrong indexing into raw vector. Also, depending on the format you might want to restrict also the rows of the raw matrix. From your question, I assume something like selector = raw(3:end,4); and data = raw(3:end,10:19); should be correct.
[~,~,raw] = xlsread('Coordinates','Centerville');
selector = raw(:,4);
data = raw(:,10:19);
[selector,~,grpidx] = unique(selector);
nGrp = numel(selector);
out = cell(nGrp,1);
for i=1:nGrp
idx = grpidx==i;
out{i} = cell2mat(data(idx,:));
end
out is the output variable. The key here is the variable grpidx that is an output of the unique function and allows you to trace back the unique values to their position in the original vector. Note that unique as I used it may change the order of the string values. If that is an issue for you, use the setOrderparameter of the unique function and set it to 'stable'

Scientific notation in MATLAB

Say I have an array that contains the following elements:
1.0e+14 *
1.3325 1.6485 2.0402 1.0485 1.2027 2.0615 1.7432 1.9709 1.4807 0.9012
Now, is there a way to grab 1.0e+14 * (base and exponent) individually?
If I do arr(10), then this will return 9.0120e+13 instead of 0.9012e+14.
Assuming the question is to grab any elements in the array with coefficient less than one. Is there a way to obtain 1.0e+14, so that I could just do arr(i) < 1.0e+14?
I assume you want string output.
Let a denote the input numeric array. You can do it this way, if you don't mind using evalc (a variant of eval, which is considered bad practice):
s = evalc('disp(a)');
s = regexp(s, '[\de+-\.]+', 'match');
This produces a cell array with the desired strings.
Example:
>> a = [1.2e-5 3.4e-6]
a =
1.0e-04 *
0.1200 0.0340
>> s = evalc('disp(a)');
>> s = regexp(s, '[\de+-\.]+', 'match')
s =
'1.0e-04' '0.1200' '0.0340'
Here is the original answer from Alain.
Basic math can tell you that:
floor(log10(N))
The log base 10 of a number tells you approximately how many digits before the decimal are in that number.
For instance, 99987123459823754 is 9.998E+016
log10(99987123459823754) is 16.9999441, the floor of which is 16 - which can basically tell you "the exponent in scientific notation is 16, very close to being 17".
Now you have the exponent of the scientific notation. This should allow you to get to whatever your goal is ;-).
And depending on what you want to do with your exponent and the number, you could also define your own method. An example is described in this thread.

Vectorize filtering on matlab cell structures

I have a huge cell vector cc (size: 1xN) of the form:
cc{1} = {'indexString1', 'str_row1col1', 'str_row1col2' }
cc{2} = {'indexString2', 'str_row2col1', 'bighello', 'str_row1col3' }
cc{3} = {'indexString3','str_row3col1'}
cc{4} = {'indexString4','str_row3col1', 'helloWorld'}
I want to traverse each cell and remove specific cells that contain the word "hello", e.g c{4}{2}. Can we do that without for loops keeping the final structure of cc?
Best,
Thoth.
EDIT: From the answers and comments I have seen that the structure of the cell impose some limitations. So any other suggestion to store my data are welcome. I just want to keep together all the cells (e.g. 'str_row1col1', 'str_row1col2') that correspond to the same indexString*n* (e.g. indexString1). I made this edit in case it helps some final reshape.
Using regular expressions, you can obtain a logical array in which zeros represent occurences of the word 'hello' somewhere in the nested cell. As #LuisMendo pointed out, this would be much easier to delete the unwanted cells if they were not nested:
clc
clear
cc{1} = {'str_row1col1', 'str_row1col2' };
cc{2} = {'str_row2col1', 'bighello', 'str_row1col3' };
cc{3} = {'str_row3col1'};
cc{4} = {'str_row3col1', 'helloWorld'};
A = (cellfun(#isempty,regexp([cc{:}],'(\w*hello|hello\w*)','match')))
Gives the following array:
A =
1 1 1 0 1 1 1 0
For the rest I think you would need a loop since the nested cells are not all of the same size. Anyhow I hope it helps you a bit.
EDIT Here is what you can do using a for loop. In order to identify words of interest (earth and water as in your comment below), simply add them to the argument in the call to regexp. This character: | is used to make some sort of list so that Matlab checks all the expressions in the brackets.
Please refer to this page for more infos on regular expressions. There is also a possibility to look for regular expressions with case-sensitivity.
Sample code, in which I added strings containing earth and water:
cc{1} = {'str_row1col1', 'earth!superman' 'str_row1col2' 'DummyString'};
cc{2} = {'str_row2col1', 'bighello', 'str_row1col3' };
cc{3} = {'str_row3col1' 'str_row3col3' 'water_batman'};
cc{4} = {'str_row3col1' 'str_row4col2' 'helloWorld'};
cc{5} = {'str_row5_LegoMan' 'str_row5col2' 'AnotherDummyString' 'Useless String' 'BonjourWorld'};
% With a for loop, for example:
FinalCell = cell(size(cc,2),1);
for k = 1:size(cc,2)
DummyCell = cc{k}; % Use dummy cell for easier indexing
% This is where you tell Matlab what words/expressions you are looking for
A = cellfun(#isempty,regexp(cc{k},'(\w*hello|hello\w*|earth|water)','match'));
DummyCell(~A) = []; % Remove the cells containing the strings/words of interest
FinalCell{k} = DummyCell;
end
Then you're good to go. Hope that helps!
The closest thing possible I found is:
clear all
cc{1} = {'str_row1col1', 'str_row1col2' };
cc{2} = {'str_row2col1', 'bighello', 'str_row1col3' };
cc{3} = {'str_row3col1'};
cc{4} = {'str_row3col1', 'helloWorld'};
cc1 = [cc{:}];
cc1 = cc1(~strcmp('bighello',cc1));
This reorganize your array into a one dimensional array and it cannot match regular expression, but only whole words.
For a better job I am afraid you have to use for loops.

Switching even- and odd-indexed characters in Matlab

Using Matlab, write a function called tripFlip that takes in one string and switches each even-indexed charactar with the odd-indexed character immediately preceding it. Use iteration. Example: tripFlip('orange') ->'ronaeg'
I assume this is homework, so I won't give a complete answer. You can use double to convert a string to an array, and char to go back, if working with arrays makes the problem any easier. Otherwise, strings seem to work just like arrays in terms of indexing: s(1) gets the first character, length(s) gets the length, etc.
I agree its a homework question, and posting it here will only bite you back in the long run. But here goes:
a = 'orange';
b = '';
b(2:2:length(a))= a(1:2:end);
b(1:2:length(a))= a(2:2:end);
disp(b)
In one line:
>> input = 'orange';
>> output = input(reshape([2:2:end;1:2:end],1,[]))
output =
ronaeg
It's not a function and doesn't use iteration, but it's how you'd solve this if you were to learn Matlab.
Something like this should do the trick, perhaps you want to make it a bit more robust.
function b = TripFlip(a)
a = 'orange';
b = '';
for i = 2:2:length(a)
b=[b a(i) a(i-1)]
end