Using the bijection rule to count binary strings with even parity - discrete-mathematics

Question:
Let B = {0, 1}. Bn is the set of binary strings with n bits. Define the set En to be the set of binary strings with n bits that have an even number of 1's. Note that zero is an even number, so a string with zero 1's (i.e., a string that is all 0's) has an even number of 1's.
(a)
Show a bijection between B^9 and E^10. Explain why your function is a bijection.
(b)
What is |E^10|?
I having trouble finding a solution that satisfies the set and is a bijection. How do I approach solving this problem.
Is it something to do with cases? For exampple, if B^9 has an even number of one's add a zero, and if there is an odd number of one's add a one to obtain E^10?
Thanks!

(a) Every string in E^10 begins with a prefix of length nine which is also a member of B^9. Given the prefix of length nine, the last bit is uniquely determined since it either must be 0 (if the prefix is also in E^9) or it must be 1 (if the prefix is not also in E^9). Therefore, for each element of E^10, there is exactly one element of B^9 to which it is uniquely mapped. Similarly, for any element in B^9, an element of E^10 can be uniquely formed by adding either a 0 or a 1 to the end of the element in B^9 (choosing the one that results in parity). This operation - appending either 0 or 1 to create parity - maps each element of B^9 to a unique element of E^10. Because there is a unique mapping from all E^10 to B^9, and from all B^9 to E^10, we have our bijection.
(b) Because there is a bijection between B^9 and E^10, we know |E^10| = |B^9|. But |B^9| = 2^9, since for each of the nine positions in any string in B^9 we can independently choose one of two values for the bit. Therefore, |E^10| = 2^9 also.

Related

how can I create a hash function in which different permutaions of digits of an integer form the same key?

for example 20986 and 96208 should generate the same key (but not 09862 or 9862 as leading zero means it not even a 5 digit number so we igore those).
One option is to get the least/max sorted permutation and then the sorted number is the hashkey, but sorting is too costly for my case. I need to generate key in O(1) time.
Other idea I have is to traverse the number and get frequency of each digits and the then get a hash function out of it. Now whats the best function to combine the frequencies given that 0<= Summation(f[i]) <= no_of_digits.
To create an order-insensitive hash simply hash each value (in your case the digits of the number) and then combine them using a commutative function (e.g. addition/multiplication/XOR). XOR is probably the most appropriate as it retains a constant hash output size and is very fast.
Also, you will want to strip away any leading 0's before hashing the number.

Maximum length of a string after performing unicode casefolding

I need to perform casefolding on a set of strings, and must ensure beforehand that they will not exceed a given length after this is done (to hard-code the needed buffer size). The problem is that a string length (in code points) may change after casefolding is applied. See, e.g., in Python3:
>>> "süß".casefold()
'süss'
Now, the maximum number of code points a string may contain after performing casefolding can be computed easily:
>>> max(len(chr(s).casefold()) for s in range(0x10FFFF + 1))
3
But is it valid in all cases? I mean, is it possible that the sequence of code points (the order in which they appear) might affect the final length of the string, due to some arcane property of Unicode? Or can I assume that the final string will always be at most 3 times longer than the original?
The Unicode standard defines casefolding as follows:
toCasefold(X): Map each character C in X to Case_Folding(C).
So every character in a string is casefolded regardless of context and the results are concatenated. This means that your assumption is correct: A casefolded string is guaranteed to have at most three times the number of code points of the original.

how to find all the possible longest common subsequence from the same position

I am trying to find all the possible longest common subsequence from the same position of multiple fixed length strings (there are 700 strings in total, each string have 25 alphabets ). The longest common subsequence must contain at least 3 alphabets and belong to at least 3 strings. So if I have:
String test1 = "abcdeug";
String test2 = "abxdopq";
String test3 = "abydnpq";
String test4 = "hzsdwpq";
I need the answer to be:
String[] Answer = ["abd", "dpq"];
My one problem is this needs to be as fast as possible. I am trying to find the answer with suffix tree, but the solution of suffix tree method is ["ab","pq"].Suffix tree can only find continuous substring from multiple strings.The common longest common subsequence algorithm cannot solve this problem.
Does anyone have any idea on how to solve this with low time cost?
Thanks
I suggest you cast this into a well known computational problem before you try to use any algorithm that sounds like it might do what you want.
Here is my suggestion: Convert this into a graph problem. For each position in the string you create a set of nodes (one for each unique letter at that position amongst all the strings in your collection... so 700 nodes if all 700 strings differ in the same position). Once you have created all the nodes for each position in the string you go through your set of strings looking at how often two positions share more than 3 equal connections. In your example we would look first at position 1 and 2 and see that three strings contain "a" in position 1 and "b" in position 2, so we add a directed edge between the node "a" in the first set of nodes of the graph and "b" in the second group of nodes (continue doing this for all pairs of positions and all combinations of letters in those two positions). You do this for each combination of positions until you have added all necessary links.
Once you have your final graph, you must look for the longest path; I recommend looking at the wikipedia article here: Longest Path. In our case we will have a directed acyclic graph and you can solve it in linear time! The preprocessing should be quadratic in the number of string positions since I imagine your alphabet is of fixed size.
P.S: You sent me an email about the biclustering algorithm I am working on; it is not yet published but will be available sometime this year (fingers crossed). Thanks for your interest though :)
You may try to use hashing.
Each string has at most 25 characters. It means that it has 2^25 subsequences. You take each string, calculate all 2^25 hashes. Then you join all the hashes for all strings and calculate which of them are contained at least 3 times.
In order to get the lengths of those subsequences, you need to store not only hashes, but pairs <hash, subsequence_pointer> where subsequence_pointer determines the subsequence of that hash (the easiest way is to enumerate all hashes of all strings and store the hash number).
Based on the algo, the program in the worst case (700 strings, 25 characters each) will run for a few minutes.

Creating representable signature of tags

Lets say I have items with assigned tags to them, like "blue", "big", "flexible". Lets say I also have a dictionary of all possible tags.
Now the question is : how can I compress all the tags to single small signature, lets say a floating point number. The requirement is that items with similar tags have similar signature.
All the tags are known forever. The signature should be relatively small, e.g. a floating point number, or a set of few integers.
Frankly, I think the scheme of boiling this down to a single number isn't worth it. Just use a 16-bit int or a 32-bit int to represent a tag. And use one of those fields for each tag you want to apply to an item. Your quest to save space will just increase complexity unnecessarily.
Assign each tag an ID number. You may want to store the mapping of tags to IDs in a separate table. Call the total number of tags N and the number of tags a given item can have M. The tag signature would be the IDs as an M-digit base-N number.
So if N = 50k and M = 3
tag 1 = 49,999
tag 2 = 1
tag 3 = 2
tag signature = 49,999 + 1 * 50,000 ^ 1 + 2 * 50,000 ^ 2 = 5,000,099,999
You'll need more than 64 bits to represent this. Use a large enough integer type to represent this value. Use more than one integer if necessary. Don't use floats, you will lose precision.

Count the number of times a number is repeating in a vector

I have created a vector containing zeros and 1's using the following command in a for loop.
G(:,i)=rand(K,1)<rand;
Since this is part of a larger problem at a particular stage I need to count the number of 1's that are present in each column.
I have tried to find the count using a for loop which is very messy and takes too long.
I found that histc can be used for this but I get an error
histc(G(:,1),1)
First input must be non-sparse numeric array.
Is there a better way to do this or am I missing something here ?
If you have a matrix G containing zeroes and ones, and you want to know how many ones are in each column, all you need is SUM:
nZeroes = sum(G);
This will give you a vector containing a total for each column in G.