Generate a random String in MatLab - matlab

I am trying to generate an array of string from a long predefined array of chars as the following
if I have the following long string:
s= 'aardvaqrkaardwolfaajronabackabacusabvaftabalongeabandonabandzonedaba'
I want to create a group of random strings based on the following rules
the string should be between 4 and 12 chars should be end or start
with one of the following chars {j,q,v,f,x,g,b,d,z}

So here a solution which gives all strings which fullfill the following rules:
the starting and ending char has to be from the string:
start_end_char= 'jqvfxgbdz';
The length has to be between 4 and 8 chars long
The string has to be sequentially correct. Meaning the resulting
strings have to appear in the exact same way in the "long" string
So what am I doing?
First of all I find all the positions where the predefined starting and ending chars appear in the main string (careful I used s2 instead of s as string name).
Then I get a sorted list of those points (list_sorted)
Next thing is to get for each element a list of indices which acceptable ending chars (following rule 1 and 2 stated above). These are saved in helper which has to be a cell-datatype because of different length in the strings
last but not least I construct all those strings and save them in resulting_strings which also have to be a cell-datatype.
s2= 'aardvaqrkaardwolfaajronabackabacusabvaftabalongeabandonabandzonedaba';
start_end_char= 'jqvfxgbdz';
length_start = length(start_end_char);
%%finding all positions of possible starting/ending points
position_char= cell(1,length_start);
for k=1:length_start
position_char{k}=find(s2==start_end_char(k));
end
list_of_start_end_points=[];
%% getting an array with all starting/ending points in the given array
for k=1:length_start
list_of_start_end_points= horzcat(list_of_start_end_points,position_char{k});
end
sorted_list= sort(list_of_start_end_points);
%% getting possible combinations
helper = cell(1, length(sorted_list));
length_helper=[];
for k=1:length(sorted_list)
helper{k}=find(and(sorted_list-sorted_list(k)>=4,sorted_list-sorted_list(k)<=8));
length_helper = length_helper + length(helper);
end
resulting_strings = cell(1, length_helper);
l=1;
for k=1:length(sorted_list)
for m=1:length(helper{k})
resulting_strings{1,l} = s2(sorted_list(k):sorted_list(helper{k}(m)));
l=l+1;
end
end
This solution is using quite a few loops, while the first 2 loops are negatable (No of loops in the size of acceptable start/ending letters), the later two loops can be quite time consuming if the original string is much longer. So maybe someone will find a vectorized solution for the later loops.

Related

Build a matrix starting from instances of structure fields in MATLAB

I'm really sorry to bother so I hope it is not a silly or repetitive question.
I have been scraping a website, saving the results as a collection in MongoDB, exporting it as a JSON file and importing it in MATLAB.
At the end of the story I obtained a struct object organised
like this one in the picture.
What I'm interested in are the two last cell arrays (which can be easily converted to string arrays with string()). The first cell array is a collection of keys (think unique products) and the second cell array is a collection of values (think prices), like a dictionary. Each field is an instance of possible values for a set of this keys (think daily prices). My goal is to build a matrix made like this:
KEYS VALUES_OF_FIELD_1 VALUES_OF_FIELD2 ... VALUES_OF_FIELDn
A x x x
B x z NaN
C z x y
D NaN y x
E y x z
The main problem is that, as shown in the image and as I tried to explain in the example matrix, I don't always have a value for all the keys in every field (as you can see sometimes they are 321, other times 319 or 320 or 317) and so the key is missing from the first array. In that case I should fill the missing value with a NaN. The keys can be ordered alphabetically and are all unique.
What would you think would be the best and most scalable way to approach this problem in MATLAB?
Thank you very much for your time, I hope I explained myself clearly.
EDIT:
Both arrays are made of strings in my case, so types are not a problem (I've modified the example). The main problem is that, since the keys vary in each field, firstly I have to find all the (unique) keys in the structure, to build the rows, and then for each column (field) I have to fill the values putting NaN where the key is missing.
One thing to remember you can't simply use both strings and number in one matrix. So, if you combine them together they can be either all strings or all numbers. I think all strings will work for you.
Before make a matrix make sure that all the cells have same element.
new_matrix = horzcat(keys,values1,...valuesn);
This will provide a matrix for each row (according to your image). Now you can use a for loop to get matrices for all the rows.
For now, I've solved it by considering the longest array of keys in the structure as the complete set of keys, let's call it keys_set.
Then I've created for each field in the structure a Map object in this way:
for i=1:length(structure)
structure(i).myMap = containers.Map(structure(i).key_field, structure(i).value_field);
end
Then I've built my matrix (M) by checking every map against the keys_set array:
for i=1:length(keys_set)
for j=1:length(structure)
if isKey(structure(j).myMap,char(keys_set(i)))
M(i,j) = string(structure(j).myMap(char(keys_set(i))));
else
M(i,j) = string('MISSING');
end
end
end
This works, but it would be ideal to also be able to check that keys_set is really complete.
EDIT: I've solved my problem by using this function and building the correct set of all the possible keys:
%% Finding the maximum number of keys in all the fields
maxnk = length(structure(1).key_field);
for i=2:length(structure)
if length(structure(i).key_field) > maxnk
maxnk = length(structure(i).key_field);
end
end
%% Initialiting the matrix containing all the possibile set of keys
keys_set=string(zeros(maxnk,length(structure)));
%% Filling the matrix by putting "0" if the dimension is smaller
for i=1:length(structure)
d = length(string(structure(i).key_field));
if d == maxnk
keys_set(:,i) = string(structure(i).key_field);
else
clear tmp
tmp = [string(structure(i).key_field); string(zeros(maxnk-d,1))];
keys_set(:,i) = tmp;
end
end
%% Merging without duplication and removing the "0" element
keys_set = union_several(keys_set);
keys_set = keys_set(keys_set ~= string(0));

Search for an exact match in string

Given a table with the following format in MATLAB:
itemids keywords
1 3D,children,anim,pixar,3D,3D pixar
2 3D,4D pixar,3D car
... ...
I want to count the number of times each keyword is repeated in each item. All the list of unique keywords are available in keywords = {'3D';'Children';'anim';'pixar' ...}. The output is a matrix TF with rows equal to the number of items and columns equal to length(keywords).
One of the difficulties here is to search for an exact match for each string. I am currently using strcmp() which seems to be giving all the entries with a given word, not exact match. In my case I would need to differentiate between 3D and 3D pixar.
This can be done using the ismember function in MATLAB. I am assuming that keywords for each item is actually a single string in which case you will need to split the keywords before doing ismember.
relevantKeyWords = {'3D','Children','anim','pixar'};
keywordsInItem = strtrim(strsplit(keywordsStr,',')) % Split the words and trim each word
tmp = ismember(relevantKeywords,keywordsInItem);
tmp will be of size 1 x length(relevantKeywords) indicating if the relevant keyword was found.

Dimensions of matrices being concatenated are not consistent using array with characters

I'm trying to initialize
labels =['dh';'Dh';'gj';'Gj';'ll';'Ll';'nj';'Nj';'rr';'Rr';'sh';'Sh';'th';'Th';'xh';'Xh';'zh';'Zh';'ç';'Ç';'ë';'Ë'];
But it shows me the error on title.When I try with numbers it's all perfect but not with characters.What could be the problem?
If you wish to eliminate any padding, you can also store it into a cell as follows.
labels = {'dh';'Dh';'gj';'Gj';
'll';'Ll';'nj';'Nj';
'rr';'Rr';'sh';'Sh';
'th';'Th';'xh';'Xh';
'zh';'Zh';'ç';'Ç';
'ë';'Ë'};
Then you can reference the "i"th element using labels{i} instead of labels(i,:) which is simpler. You can further run more string operations using cellfun and not interfere with any existing values that you've stored.
I agree with krisdestruction that using a cell array makes the code accessing the strings simpler and is generally more idiomatic. That is what I would also recommend unless there is a compelling reason to do something else.
For completeness, you could use the char function to add the padding automatically for you if you really want a character array:
>> char('aa','bb','c')
ans =
aa
bb
c
where the last row is 'c '. From the char documentation:
S = char(A1,...,AN) converts the arrays A1,...,AN into a single character array. After conversion to characters, the input arrays become rows in S. Each row is automatically padded with blanks as needed. An empty string becomes a row of blanks.
(Emphasis mine)
From the Mathworks documentation:
Apply the MATLAB concatenation operator, []. Separate each row with a semicolon (;). Each row must contain the same number of characters. For example, combine three strings of equal length:
You can try padding like this to make every row 2 characters:
labels = ['dh';'Dh';'gj';'Gj';
'll';'Ll';'nj';'Nj';
'rr';'Rr';'sh';'Sh';
'th';'Th';'xh';'Xh';
'zh';'Zh';'ç ';'Ç ';
'ë ';'Ë '];

How can i write a number values in powers of 10? [duplicate]

This question already has answers here:
What is the Small "e" in Scientific Notation / Double in Matlab
(2 answers)
Closed 7 years ago.
How can I write a number/Integer value to power of 10, e.g. 1000 as 10^3? I am writing code whose output is a string of very large numbers. My output in longEng format is:
4.40710646596169e+018
16.9749211806197e+186
142.220634811050e+078
508.723835280617e+204
1.15401317731033e-177
129.994388899690e+168
14.3008811642810e+153
1.25899227268954e+165
24.1450064703939e+150
627.108997290435e+144
2.03728822649372e+177
339.903986115177e-066
150.360900017430e+183
5.39003779219462e+135
183.893417489826e+198
648.544709490386e+045
19.7574461055182e+198
3.91455750674308e+102
6.41548629454028e-114
70.4943280639616e+096
19.7574461055182e+198
3.11450571506133e-009
249.857950606210e+093
4.64921904682151e+180
750.343029004712e+147
I want these results to be in a format of power of 10, so that I can easily do arithmetic operations for my next function.
you can write format shortE and see you output like this:
4.4071e+18
1.6975e+187
1.4222e+80
5.0872e+206
If you only want to print the data in scientific format, the Matlab itself can do this for you.
If you can to obtain the scientific notation form as
a * 10^b,
i.e., obtain the coefficient a and the exponent b, you can first obtain the b as:
b = floor(log10(abs(x)));
then the a as:
a = x * 10^(-b);
from my understanding you wish to take your number e.g. 4.40710646596169e+018 and split it up into:
4.40710646596169 and 018 once you have them separated you you can perform operations as you wish.
You can even join them back to look like: 4.40710646596169^018 if you so desire (although to look like that they would be strings and therefore mathematical operations on the number would be NAN).
Since e represents to the power 10 and is present in all numbers you listed this is a simple process with many solutions, here is one.
% format long is very important otherwise it will appear to you that you have
%lost precision. MATLAB hides precision from view to save screen space and to
%produce less confusing results to the viewer. (the precision is still there but
%with format long you will be able to see it.
format long
x = 4.40710646596169e+018;
%convert your number into a string, this will allow you to split the number based
%on the always present e+ 'delimiter' (not really a delimiter but looks like one')
s = num2str(x);
%use strsplit to perform the split in the required place. it will output a 1x2
%cell
D = strsplit(s, {'e+'});
%extract each cell to a separate variable. in fact D{1} can be directly used for
%the input of the next function.
D11 = D{1};
D22 = D{2};
%convert the separated strings back into numbers with double precision (keep
%maintin value accuracy)
D1 = str2double(D11)
D2 = str2double(D22)
in order to do this operation on an entire column vector it is simply a matter of using a for loop to iterate through all the numbers you have.

how to find all the possible longest common subsequence from the same position

I am trying to find all the possible longest common subsequence from the same position of multiple fixed length strings (there are 700 strings in total, each string have 25 alphabets ). The longest common subsequence must contain at least 3 alphabets and belong to at least 3 strings. So if I have:
String test1 = "abcdeug";
String test2 = "abxdopq";
String test3 = "abydnpq";
String test4 = "hzsdwpq";
I need the answer to be:
String[] Answer = ["abd", "dpq"];
My one problem is this needs to be as fast as possible. I am trying to find the answer with suffix tree, but the solution of suffix tree method is ["ab","pq"].Suffix tree can only find continuous substring from multiple strings.The common longest common subsequence algorithm cannot solve this problem.
Does anyone have any idea on how to solve this with low time cost?
Thanks
I suggest you cast this into a well known computational problem before you try to use any algorithm that sounds like it might do what you want.
Here is my suggestion: Convert this into a graph problem. For each position in the string you create a set of nodes (one for each unique letter at that position amongst all the strings in your collection... so 700 nodes if all 700 strings differ in the same position). Once you have created all the nodes for each position in the string you go through your set of strings looking at how often two positions share more than 3 equal connections. In your example we would look first at position 1 and 2 and see that three strings contain "a" in position 1 and "b" in position 2, so we add a directed edge between the node "a" in the first set of nodes of the graph and "b" in the second group of nodes (continue doing this for all pairs of positions and all combinations of letters in those two positions). You do this for each combination of positions until you have added all necessary links.
Once you have your final graph, you must look for the longest path; I recommend looking at the wikipedia article here: Longest Path. In our case we will have a directed acyclic graph and you can solve it in linear time! The preprocessing should be quadratic in the number of string positions since I imagine your alphabet is of fixed size.
P.S: You sent me an email about the biclustering algorithm I am working on; it is not yet published but will be available sometime this year (fingers crossed). Thanks for your interest though :)
You may try to use hashing.
Each string has at most 25 characters. It means that it has 2^25 subsequences. You take each string, calculate all 2^25 hashes. Then you join all the hashes for all strings and calculate which of them are contained at least 3 times.
In order to get the lengths of those subsequences, you need to store not only hashes, but pairs <hash, subsequence_pointer> where subsequence_pointer determines the subsequence of that hash (the easiest way is to enumerate all hashes of all strings and store the hash number).
Based on the algo, the program in the worst case (700 strings, 25 characters each) will run for a few minutes.