what is the grammar G={V,L,S,P} behind the set of all strings containing an unequal number of 0s and 1s? - discrete-mathematics

Find a phrase-structure grammar for each of these languages.
g)the set of all strings containing an unequal number of 0s and 1s.

First, we can break this problem up by recognizing that an unequal number of 0s and 1s means that there are either more 0s, or more 1s. This suggests a grammar that can go either way:
S := R | T
R := (more 0s than 1s)
T := (more 1s than 0s)
The expressions for R and T should probably be pretty similar and just have the symbols reversed.
How can we guarantee there are more 0s than 1s, or vice versa? Well, we can insert at least one and maybe multiple 0s or 1s and then pad with strings that have the same number of 0s as 1s:
R := E0R | E0E
T := E1R | E1E
These will produce intermediate forms like E0E0E0E, E1E, etc. The idea here is that any string with exactly k more 0s than 1s (or 1s than 0s) can be written as k 0s (or 1s) separated by substrings with equal numbers of 0s and 1s. This seems reasonable but really should be proven (left as an exercise).
All that remains is to give productions for strings with the same number of 0s and 1s:
E := EE | 0E1 | 1E0 | e
To see this works, we can use induction. Base cases can include the shortest strings e, 01 and 10, which we can see pretty easily work. For the induction step, just note that there must be some longest prefix with the numbers of 0s and 1s equal and that its first and last symbols must be different; if it's the whole string then it can be obtained from a string of length two less byy the productions 0E1 or 1E0, otherwise, it can be obtained by production EE on two shorter strings (the prefix and suffix are shorter and covered by the induction hypothesis).
The whole grammar turns out looking like:
S := R | T
R := E0R | E0E
T := E1R | E1E
E := EE | 0E1 | 1E0 | e
Is this the shortest, most efficient, most unambiguous, etc. grammar for this language? Who knows, but it should work!

Related

How can 3-state bit packed together?

I am looking for a clever solution that would allow to pack into a 16 bits integer, at least nine 3-state 'bits'. It should also still be possible to easily set the value of one these 3-state 'bit'.
As an example, it could be used to encode a tic-tac-toe position, the tree state being, _ (empty), X (me), O (opponent) for the nine square of the board.
Naturally using 2 bits per square would do the job, but it would require 18bits overall. Is there an encoding that would use only 1.7 bits at most per square, and still stay simple for working with it ?
You can store ten 3-state values in a 16-bit integer, since 310 = 59049 < 65536. Simply encode a 10-digit base-3 number into a 16-bit integer, and pull the digits out going the other way.
To encode each digit d, the repeated operation is n = 3*n + d. To decode the digits in the opposite order, the repeated operations are d = n % 3 and n /= 3.

how to get only positive results when applying hashCode()?

I am working on a Scala code that convert set of unique strings to unique IDs. I applied HashCode() but I got negative numbers and I need to work only with positive numbers.
I know that I have to use math.abs to get rid of the negative values but I am not sure if this is the correct solution or not.
If I read before that something like this could solve my problem
math.abs(hashCode()) * constant % size
how can I determine the right constant? and does the size means the total number of strings?
previous questions related to that topic solved the question by using math.abs only but if the total number of string is large an overflow could happen and there is a chance to get a negative number as well. by multiplying the result with constant and take the mod of size could help. This is why I need to understand how to determine the constant and the size?
Also is there another way to get unique numbers for unique strings?
We can phrase your problem another way: How to get an unsigned number from a signed number with the same range?
Suppose you are using an Integer. Its value goes from -2147483648 to 2147483647. Now you need to convert this value into the positive range 0 to 2147483647.
Step 1:
ADD a constant to move the range upwards to 0. You can do this by adding 2147483648 to the value. But now the highest possible value is much greater than the MAX.
Step 2:
So use MODULO to move the value back into the required range.
For example, consider the values -2000 and 2000000000.
| STEP | MIN VALUE | EXAMPLE 1 | EXAMPLE 2 | MAX VALUE |
|-------------------|------------|------------|------------|------------|
| original |-2147483648 | -2000 | 2000000000 | 2147483647 |
| add 2147483648 | 0 | 2147481648 | 4147483648 | 4294967295 |
| modulo 2147483648 | 0 | 2147481648 | 2000000001 | 2147483647 |
So the final formula is:
(NUMBER + 2147483648) % 2147481648
Warning:
Hash codes are not designed to give unique values. There are chances of getting the same hash for two different strings. Also, any scaling operations on the hash (like division, modulo) can further reduce uniqueness.
To strip a sign from an Int, you can just use .abs. It does break on Int.MinValue, but you can just special case it:
def stripSign(n: Int) = math.abs(n) max 0
or simply drop the sign bit:
def stripSign2(n: Int) = n & Int.MaxValue
Or just use negative numbers (what's wrong with them anyway?).
To your other question, you cannot convert a bunch of unique strings to ints, and guarantee that there won't be duplications (for the simple reason that there are more strings than distinct Ints, so, if you wanted to assign an unique int to each of them, you'd run out of ints before you run out of strings), so you have to be able to handle collisions, however infrequent.
You can only shoot for lowering the probability of a collision by making your hash longer (with a 32-bit hash code, you have about 50% probability of at least one collision in a population of approximately 75000 strings, with 31 bits (if you do not want negative numbers), it is 55000, but with a 64-bit hash, the "magic number" is about 5 billion, provided that your hash function is good enough, and produces the numbers that are very evenly distributed).

Hashing using division method

For the hash function : h(k) = k mod m;
I understand that m=2^n will always give the last n LSB digits. I also understand that m=2^p-1 when K is a string converted to integers using radix 2^p will give same hash value for every permutation of characters in K. But why exactly "a prime not too close to an exact power of 2" is a good choice? What if I choose 2^p - 2 or 2^p-3? Why are these choices considered bad?
Following is the text from CLRS:
"A prime not too close to an exact power of 2 is often a good choice for m. For
example, suppose we wish to allocate a hash table, with collisions resolved by
chaining, to hold roughly n D 2000 character strings, where a character has 8 bits.
We don’t mind examining an average of 3 elements in an unsuccessful search, and
so we allocate a hash table of size m D 701. We could choose m D 701 because
it is a prime near 2000=3 but not near any power of 2."
Suppose we work with radix 2p.
2p-1 case:
Why that is a bad idea to use 2p-1? Let us see,
k = ∑ai2ip
and if we divide by 2p-1 we just get
k = ∑ai2ip = ∑ai mod 2p-1
so, as addition is commutative, we can permute digits and get the same result.
2p-b case:
Quote from CLRS:
A prime not too close to an exact power of 2 is often a good choice for m.
k = ∑ai2ip = ∑aibi mod 2p-b
So changing least significant digit by one will change hash by one. Changing second least significant bit by one will change hash by two. To really change hash we would need to change digits with bigger significance. So, in case of small b we face problem similar to the case then m is power of 2, namely we depend on distribution of least significant digits.

Set of unambiguous looking letters & numbers for user input

Is there an existing subset of the alphanumerics that is easier to read? In particular, is there a subset that has fewer characters that are visually ambiguous, and by removing (or equating) certain characters we reduce human error?
I know "visually ambiguous" is somewhat waffly of an expression, but it is fairly evident that D, O and 0 are all similar, and 1 and I are also similar. I would like to maximize the size of the set of alpha-numerics, but minimize the number of characters that are likely to be misinterpreted.
The only precedent I am aware of for such a set is the Canada Postal code system that removes the letters D, F, I, O, Q, and U, and that subset was created to aid the postal system's OCR process.
My initial thought is to use only capital letters and numbers as follows:
A
B = 8
C = G
D = 0 = O = Q
E = F
H
I = J = L = T = 1 = 7
K = X
M
N
P
R
S = 5
U = V = Y
W
Z = 2
3
4
6
9
This problem may be difficult to separate from the given type face. The distinctiveness of the characters in the chosen typeface could significantly affect the potential visual ambiguity of any two characters, but I expect that in most modern typefaces the above characters that are equated will have a similar enough appearance to warrant equating them.
I would be grateful for thoughts on the above – are the above equations suitable, or perhaps are there more characters that should be equated? Would lowercase characters be more suitable?
I needed a replacement for hexadecimal (base 16) for similar reasons (e.g. for encoding a key, etc.), the best I could come up with is the following set of 16 characters, which can be used as a replacement for hexadecimal:
0 1 2 3 4 5 6 7 8 9 A B C D E F Hexadecimal
H M N 3 4 P 6 7 R 9 T W C X Y F Replacement
In the replacement set, we consider the following:
All characters used have major distinguishing features that would only be omitted in a truly awful font.
Vowels A E I O U omitted to avoid accidentally spelling words.
Sets of characters that could potentially be very similar or identical in some fonts are avoided completely (none of the characters in any set are used at all):
0 O D Q
1 I L J
8 B
5 S
2 Z
By avoiding these characters completely, the hope is that the user will enter the correct characters, rather than trying to correct mis-entered characters.
For sets of less similar but potentially confusing characters, we only use one character in each set, hopefully the most distinctive:
Y U V
Here Y is used, since it always has the lower vertical section, and a serif in serif fonts
C G
Here C is used, since it seems less likely that a C would be entered as G, than vice versa
X K
Here X is used, since it is more consistent in most fonts
F E
Here F is used, since it is not a vowel
In the case of these similar sets, entry of any character in the set could be automatically converted to the one that is actually used (the first one listed in each set). Note that E must not be automatically converted to F if hexadecimal input might be used (see below).
Note that there are still similar-sounding letters in the replacement set, this is pretty much unavoidable. When reading aloud, a phonetic alphabet should be used.
Where characters that are also present in standard hexadecimal are used in the replacement set, they are used for the same base-16 value. In theory mixed input of hexadecimal and replacement characters could be supported, provided E is not automatically converted to F.
Since this is just a character replacement, it should be easy to convert to/from hexadecimal.
Upper case seems best for the "canonical" form for output, although lower case also looks reasonable, except for "h" and "n", which should still be relatively clear in most fonts:
h m n 3 4 p 6 7 r 9 t w c x y f
Input can of course be case-insensitive.
There are several similar systems for base 32, see http://en.wikipedia.org/wiki/Base32 However these obviously need to introduce more similar-looking characters, in return for an additional 25% more information per character.
Apparently the following set was also used for Windows product keys in base 24, but again has more similar-looking characters:
B C D F G H J K M P Q R T V W X Y 2 3 4 6 7 8 9
My set of 23 unambiguous characters is:
c,d,e,f,h,j,k,m,n,p,r,t,v,w,x,y,2,3,4,5,6,8,9
I needed a set of unambiguous characters for user input, and I couldn't find anywhere that others have already produced a character set and set of rules that fit my criteria.
My requirements:
No capitals: this supposed to be used in URIs, and typed by people who might not have a lot of typing experience, for whom even the shift key can slow them down and cause uncertainty. I also want someone to be able to say "all lowercase" so as to reduce uncertainty, so I want to avoid capital letters.
Few or no vowels: an easy way to avoid creating foul language or surprising words is to simply omit most vowels. I think keeping "e" and "y" is ok.
Resolve ambiguity consistently: I'm open to using some ambiguous characters, so long as I only use one character from each group (e.g., out of lowercase s, uppercase S, and five, I might only use five); that way, on the backend, I can just replace any of these ambiguous characters with the one correct character from their group. So, the input string "3Sh" would be replaced with "35h" before I look up its match in my database.
Only needed to create tokens: I don't need to encode information like base64 or base32 do, so the exact number of characters in my set doesn't really matter, besides my wanting to to be as large as possible. It only needs to be useful for producing random UUID-type id tokens.
Strongly prefer non-ambiguity: I think it's much more costly for someone to enter a token and have something go wrong than it is for someone to have to type out a longer token. There's a tradeoff, of course, but I want to strongly prefer non-ambiguity over brevity.
The confusable groups of characters I identified:
A/4
b/6/G
8/B
c/C
f/F
9/g/q
i/I/1/l/7 - just too ambiguous to use; note that european "1" can look a lot like many people's "7"
k/K
o/O/0 - just too ambiguous to use
p/P
s/S/5
v/V
w/W
x/X
y/Y
z/Z/2
Unambiguous characters:
I think this leaves only 9 totally unambiguous lowercase/numeric chars, with no vowels:
d,e,h,j,m,n,r,t,3
Adding back in one character from each of those ambiguous groups (and trying to prefer the character that looks most distinct, while avoiding uppercase), there are 23 characters:
c,d,e,f,h,j,k,m,n,p,r,t,v,w,x,y,2,3,4,5,6,8,9
Analysis:
Using the rule of thumb that a UUID with a numerical equivalent range of N possibilities is sufficient to avoid collisions for sqrt(N) instances:
an 8-digit UUID using this character set should be sufficient to avoid collisions for about 300,000 instances
a 16-digit UUID using this character set should be sufficient to avoid collisions for about 80 billion instances.
Mainly drawing inspiration from this ux thread, mentioned by #rwb,
Several programs use similar things. The list in your post seems to be very similar to those used in these programs, and I think it should be enough for most purposes. You can add always add redundancy (error-correction) to "forgive" minor mistakes; this will require you to space-out your codes (see Hamming distance), though.
No references as to particular method used in deriving the lists, except trial and error
with humans (which is great for non-ocr: your users are humans)
It may make sense to use character grouping (say, groups of 5) to increase context ("first character in the second of 5 groups")
Ambiguity can be eliminated by using complete nouns (from a dictionary with few look-alikes; word-edit-distance may be useful here) instead of characters. People may confuse "1" with "i", but few will confuse "one" with "ice".
Another option is to make your code into a (fake) word that can be read out loud. A markov model may help you there.
If you have the option to use only capitals, I created this set based on characters which users commonly mistyped, however this wholly depends on the font they read the text in.
Characters to use: A C D E F G H J K L M N P Q R T U V W X Y 3 4 6 7 9
Characters to avoid:
B similar to 8
I similar to 1
O similar to 0
S similar to 5
Z similar to 2
What you seek is an unambiguous, efficient Human-Computer code. What I recommend is to encode the entire data with literal(meaningful) words, nouns in particular.
I have been developing a software to do just that - and most efficiently. I call it WCode. Technically its just Base-1024 Encoding - wherein you use words instead of symbols.
Here are the links:
Presentation: https://docs.google.com/presentation/d/1sYiXCWIYAWpKAahrGFZ2p5zJX8uMxPccu-oaGOajrGA/edit
Documentation: https://docs.google.com/folder/d/0B0pxLafSqCjKOWhYSFFGOHd1a2c/edit
Project: https://github.com/San13/WCode (Please wait while I get around uploading...)
This would be a general problem in OCR. Thus for end to end solution where in OCR encoding is controlled - specialised fonts have been developed to solve the "visual ambiguity" issue you mention of.
See: http://en.wikipedia.org/wiki/OCR-A_font
as additional information : you may want to know about Base32 Encoding - wherein symbol for digit '1' is not used as it may 'confuse' the users with the symbol for alphabet 'l'.
Unambiguous looking letters for humans are also unambiguous for optical character recognition (OCR). By removing all pairs of letters that are confusing for OCR, one obtains:
!+2345679:BCDEGHKLQSUZadehiopqstu
See https://www.monperrus.net/martin/store-data-paper
It depends how large you want your set to be. For example, just the set {0, 1} will probably work well. Similarly the set of digits only. But probably you want a set that's roughly half the size of the original set of characters.
I have not done this, but here's a suggestion. Pick a font, pick an initial set of characters, and write some code to do the following. Draw each character to fit into an n-by-n square of black and white pixels, for n = 1 through (say) 10. Cut away any all-white rows and columns from the edge, since we're only interested in the black area. That gives you a list of 10 codes for each character. Measure the distance between any two characters by how many of these codes differ. Estimate what distance is acceptable for your application. Then do a brute-force search for a set of characters which are that far apart.
Basically, use a script to simulate squinting at the characters and see which ones you can still tell apart.
Here's some python I wrote to encode and decode integers using the system of characters described above.
def base20encode(i):
"""Convert integer into base20 string of unambiguous characters."""
if not isinstance(i, int):
raise TypeError('This function must be called on an integer.')
chars, s = '012345689ACEHKMNPRUW', ''
while i > 0:
i, remainder = divmod(i, 20)
s = chars[remainder] + s
return s
def base20decode(s):
"""Convert string to unambiguous chars and then return integer from resultant base20"""
if not isinstance(s, str):
raise TypeError('This function must be called on a string.')
s = s.translate(bytes.maketrans(b'BGDOQFIJLT7KSVYZ', b'8C000E11111X5UU2'))
chars, i, exponent = '012345689ACEHKMNPRUW', 0, 1
for number in s[::-1]:
i += chars.index(number) * exponent
exponent *= 20
return i
base20decode(base20encode(10))
base58:123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz

How to generate all possible combinations n-bit strings?

Given a positive integer n, I want to generate all possible n bit combinations in matlab.
For ex : If n=3, then answer should be
000
001
010
011
100
101
110
111
How do I do it ?
I want to actually store them in matrix. I tried
for n=1:2^4
r(n)=dec2bin(n,5);
end;
but that gave error "In an assignment A(:) = B, the number of elements in A and B must be the same.
Just loop over all integers in [0,2^n), and print the number as binary. If you always want to have n digits (e.g. insert leading zeros), this would look like:
for ii=0:2^n-1,
fprintf('%0*s\n', n, dec2bin(ii));
end
Edit: there are a number of ways to put the results in a matrix. The easiest is to use
x = dec2bin(0:2^n-1);
which will produce an n-by-2^n matrix of type char. Each row is one of the bit strings.
If you really want to store strings in each row, you can do this:
x = cell(1, 2^n);
for ii=0:2^n-1,
x{ii} = dec2bin(ii);
end
However, if you're looking for efficient processing, you should remember that integers are already stored in memory in binary! So, the vector:
x = 0 : 2^n-1;
Contains the binary patterns in the most memory efficient and CPU efficient way possible. The only trade-off is that you will not be able to represent patterns with more than 32 of 64 bits using this compact representation.
This is a one-line answer to the question which gives you a double array of all 2^n bit combinations:
bitCombs = dec2bin(0:2^n-1) - '0'
So many ways to do this permutation. If you are looking to implement with an array counter: set an array of counters going from 0 to 1 for each of the three positions (2^0,2^1,2^2). Let the starting number be 000 (stored in an array). Use the counter and increment its 1st place (2^0). The number will be 001. Reset the counter at position (2^0) and increase counter at 2^1 and go on a loop till you complete all the counters.