how to find all the possible longest common subsequence from the same position - suffix-tree

I am trying to find all the possible longest common subsequence from the same position of multiple fixed length strings (there are 700 strings in total, each string have 25 alphabets ). The longest common subsequence must contain at least 3 alphabets and belong to at least 3 strings. So if I have:
String test1 = "abcdeug";
String test2 = "abxdopq";
String test3 = "abydnpq";
String test4 = "hzsdwpq";
I need the answer to be:
String[] Answer = ["abd", "dpq"];
My one problem is this needs to be as fast as possible. I am trying to find the answer with suffix tree, but the solution of suffix tree method is ["ab","pq"].Suffix tree can only find continuous substring from multiple strings.The common longest common subsequence algorithm cannot solve this problem.
Does anyone have any idea on how to solve this with low time cost?
Thanks

I suggest you cast this into a well known computational problem before you try to use any algorithm that sounds like it might do what you want.
Here is my suggestion: Convert this into a graph problem. For each position in the string you create a set of nodes (one for each unique letter at that position amongst all the strings in your collection... so 700 nodes if all 700 strings differ in the same position). Once you have created all the nodes for each position in the string you go through your set of strings looking at how often two positions share more than 3 equal connections. In your example we would look first at position 1 and 2 and see that three strings contain "a" in position 1 and "b" in position 2, so we add a directed edge between the node "a" in the first set of nodes of the graph and "b" in the second group of nodes (continue doing this for all pairs of positions and all combinations of letters in those two positions). You do this for each combination of positions until you have added all necessary links.
Once you have your final graph, you must look for the longest path; I recommend looking at the wikipedia article here: Longest Path. In our case we will have a directed acyclic graph and you can solve it in linear time! The preprocessing should be quadratic in the number of string positions since I imagine your alphabet is of fixed size.
P.S: You sent me an email about the biclustering algorithm I am working on; it is not yet published but will be available sometime this year (fingers crossed). Thanks for your interest though :)

You may try to use hashing.
Each string has at most 25 characters. It means that it has 2^25 subsequences. You take each string, calculate all 2^25 hashes. Then you join all the hashes for all strings and calculate which of them are contained at least 3 times.
In order to get the lengths of those subsequences, you need to store not only hashes, but pairs <hash, subsequence_pointer> where subsequence_pointer determines the subsequence of that hash (the easiest way is to enumerate all hashes of all strings and store the hash number).
Based on the algo, the program in the worst case (700 strings, 25 characters each) will run for a few minutes.

Related

append an atom with exisiting variables and create new set in clingo

I am totally new in asp, I am learning clingo and I have a problem with variables. I am working on graphs and paths in the graphs so I used a tuple such as g((1,2,3)). what I want is to add new node to the path in which the tuple sequence holds. for instance the code below will give me (0, (1,2,3)) but what I want is (0,1,2,3).
Thanks in advance.
g((1,2,3)).
g((0,X)):-g(X).
Naive fix:
g((0,X,Y,Z)) :- g((X,Y,Z)).
However I sense that you want to store the path in the tuple as is it is a list. Bad news: unlike prolog clingo isn't meant to handle lists as terms of atoms (like your example does). Lists are handled by indexing the elements, for example the list [a,b,c] would be stored in predicates like p(1,a). p(2,b). p(3,c).. Why? Because of grounding: you aim to get a small ground program to reduce the complexity of the solving process. To put it in numbers: assuming you are searching for a path which includes all n nodes. This sums up to n!. For n=10 this are 3628800 potential paths, introducing 3628800 predicates for a comparively small graph. Numbering the nodes as mentioned will lead to only n*n potential predicates to represent the path. For n=10 these are just 100, in comparison to 3628800 a huge gain.
To get an impression what you are searching for, run the following example derived from the potassco website:
% generating path: for every time exactly one node
{ path(T,X) : node(X) } = 1 :- T=1..6.
% one node isn't allowed on two different positions
:- path(T1,X), path(T2,X), T1!=T2.
% there has to be an edge between 2 adjascent positions
:- path(T,X), path(T+1,Y), not edge(X,Y).
#show path/2.
% Nodes
node(1..6).
% (Directed) Edges
edge(1,(2;3;4)). edge(2,(4;5;6)). edge(3,(1;4;5)).
edge(4,(1;2)). edge(5,(3;4;6)). edge(6,(2;3;5)).
Output:
Answer: 1
path(1,1) path(2,3) path(3,4) path(4,2) path(5,5) path(6,6)
Answer: 2
path(1,1) path(2,3) path(3,5) path(4,4) path(5,2) path(6,6)
Answer: 3
path(1,6) path(2,2) path(3,5) path(4,3) path(5,4) path(6,1)
Answer: 4
path(1,1) path(2,4) path(3,2) path(4,5) path(5,6) path(6,3)
Answer: 5
...

Using the bijection rule to count binary strings with even parity

Question:
Let B = {0, 1}. Bn is the set of binary strings with n bits. Define the set En to be the set of binary strings with n bits that have an even number of 1's. Note that zero is an even number, so a string with zero 1's (i.e., a string that is all 0's) has an even number of 1's.
(a)
Show a bijection between B^9 and E^10. Explain why your function is a bijection.
(b)
What is |E^10|?
I having trouble finding a solution that satisfies the set and is a bijection. How do I approach solving this problem.
Is it something to do with cases? For exampple, if B^9 has an even number of one's add a zero, and if there is an odd number of one's add a one to obtain E^10?
Thanks!
(a) Every string in E^10 begins with a prefix of length nine which is also a member of B^9. Given the prefix of length nine, the last bit is uniquely determined since it either must be 0 (if the prefix is also in E^9) or it must be 1 (if the prefix is not also in E^9). Therefore, for each element of E^10, there is exactly one element of B^9 to which it is uniquely mapped. Similarly, for any element in B^9, an element of E^10 can be uniquely formed by adding either a 0 or a 1 to the end of the element in B^9 (choosing the one that results in parity). This operation - appending either 0 or 1 to create parity - maps each element of B^9 to a unique element of E^10. Because there is a unique mapping from all E^10 to B^9, and from all B^9 to E^10, we have our bijection.
(b) Because there is a bijection between B^9 and E^10, we know |E^10| = |B^9|. But |B^9| = 2^9, since for each of the nine positions in any string in B^9 we can independently choose one of two values for the bit. Therefore, |E^10| = 2^9 also.

Using Matlab to find output at specified input values

spacing_Pin = transpose(-27:0.0001:2);
thetah_2nd = Phi_intrp3(ismembertol(spacing_Pin,P_in2nd));
With this code, I want to evaluate Phi_intrp3at indices where spacing_Pinis equal to P_in2nd
I know I have asked similar questions before. And I have got some really helpful answers already. But in this case they do not seem to apply. P_in2ndhas only 40 entries, whereas spacing_Pinhas far more. Therefore I cannot consider the absolute value of the difference of spacing_Pinand P_in2ndto find out where they are closest to equal.
so P_in2ndhas values between -25.9747 and -0.0147. The decimals have 4 digits after the dot, but these are sometimes rounded by Matlab (format short). That's the catch, I think, P_in2nd is not found in spacing_Pin. The result is an empty matrix.
Here's the first 5 entries of P_in2nd:
-25,9747431735299
-24,9747431735299
-23,9947431735299
-23,0047431735299
-22,0047431735299
Now, I want to evaluate ¸Phi_intrp3at these values. For this purpose I can change spacing_Pin, but not P_in2nd. For example, when I search for the first entry of P_in2ndin spacing_Pin, I find that entry 10254 = -25,9747000000000. So I want to evaluate Phi_intrp3at this input entry.
Is there a way of doing this?

Genetic-algorithm encoding

I am trying to create an algorithm which I believe is similar to a knapsack-problem. The problem is to find recipes/Bill-of-Materials for certain intermediate products. There are different alternatives of recipes for the intermediate products. For example product X can either consist of 25 % raw material A + 75 % raw material B, or 50 % of raw material A + 50 % raw material B, etc. There are between 1 to 100 different alternatives for each recipe.
My question is, how best to encode the different recipe alternatives (and/or where to find similar problems on the internet). I think I have to use value encoding, ie assign a value to each alternative of a recipe. Do I have reasonable, different options?
Thanks & kind regards
You can encode the problem with a number chromosome. If your product has N ingredients, then your number chromosome has the length N: X={x1,x2,..,xN}. Every number xi of the chromosome represents the parts of ingredient i. It is not required, that the numbers sum to one.
E.g. X={23,5,0} means, you need 23 parts of ingredient 1, 5 parts of ingredient 2 and zero parts of ingredient 3.
With this encoding, crossover will not invalidate the chromosome.
You can use a 100 dimentions variable to present a individual just like below
X={x1,x2,x3,...,x100} xi∈[0,1] ∑(xi)=1.0
It's hard to use crossover operation.So I suggest that the offspring can just be produced by mutation operation.
Mutation operation toward parent individual 'X':
(1)randly choose two dimention 'xi' and 'xj' from 'X';
(2)p=rand(0,1);
(3)xj=xj+(1-p)*xi;
(4)xi=xi*p;

Perl - determining the intersection of several numeric ranges

I would like to be able to load long list of positive integer ranges and create a new "summary" range list that is the union of the intersections of each pairs of ranges. And, I want to do this in Perl. For example:
Sample ranges: (1..30) (45..90) (15..34) (92..100)
Intersection of ranges: (15..30)
The only way I could think of was using a bunch of nested if statements to determine the starting point of sample A, sample B, sample C, etc. and figure out the overlap this way, but it's not possible to do that with hundreds of sample, each containing numerous ranges.
Any suggestions are appreciated!
The first thing you should do when you need to do some thing is take a look at CPAN to see what tools are available of if someone has solved your problem for you already.
Set::IntSpan and Set::IntRange are on the first page of results for "set" on CPAN.
What you want is the union of the intersection of each pair of ranges, so the algorithm is as follows:
Create an empty result set.
Create a set for each range.
For each set in the list,
For each later set in the list,
Find the intersection of those two sets.
Find the union of the result set and this intersection. This is the new result set.
Enumerate the elements of the result set.
I don't have code to share, but I would expand each range into hash, or use a Set module, and then use intersection operations on the sets.