Unambiguous binary encoding scheme for the alphabet - encoding

An old British Informatics Olympiad question (3c) asks what the smallest unambiguous encoding scheme for the alphabet (using only two symbols - hence binary) is. As far as I can see, the answer is 130 - 5 bits are required to store each letter, as 2^4 < 26. The alphabet has 26 characters, so the encoding scheme is 5*26 bits long. However, the mark scheme states that 124 bits can be used. What is the encoding scheme that is that long?

I think this works:
a - 0010
b - 0011
c - 0100
d - 0101
e - 0110
f - 0111
g - 10000
h - 10001
i - 10010
j - 10011
k - 10100
l - 10101
m - 10110
n - 10111
o - 11000
p - 11001
q - 11010
r - 11011
s - 11100
t - 11101
u - 11110
v - 11111
w - 00000
x - 00001
y - 00010
z - 00011
It is unambiguous. If a symbol starts with two or fewer zeros, it is of length 4. If it starts with a 1, it is length 5. If it starts with 000 then it is also length 5.
I got the idea by starting with a through h being length 4, using 0 as the first symbol. However, a scheme like that is short two symbols (if length is predicated entirely by the first symbol), so I looked for a way to reduce the number of four symbol codes by two... and noticed that 0000 and 0001 were the only two that had a triple0. Two bits give you four characters and the rest is an unambiguous encoding scheme :)
6 * 4 + 20 * 5 = 124
or alternatively
4 + 16 + 6 = 26

The trick here is to not use a fixed-length encoding (as you have pointed out, ld(26) is somewhere between 4 and 5, thus we have unused blocks in a 5-bit encoding scheme), but vary the length of our data words so we get an optimized number of bits for each leter.
When creating a table of the 32 combinations, we can assign the letters A-Z to each value, with A starting at 00000, B = 00001 and so on. Z will be 11001 – the rest (11010…11111) will be unused.
Now it gets a bit trickier. We have six combinations at the end which are not used, but we cannot simply drop them, as there is no such thing as "half a bit of information". Therefore, we need to distribute six combinations so that we can drop the last bit of each of them. Example:
10100 = U, 10101 = V
becomes
10100 = U, 10110 = V
The other combinations are moved accordingly so the last bit of each of the last six letters is a "0". Then this bit can be dropped, so we end with these letters:
00000 = A, 00001 = B, …, 10011 = T, 1010 = U, 1011 = V, 1100 = W, 1101 = X, 1110 = Y, Z = 1111
Important: While this scheme is prefix-free (i.e. no combination is the start of another, longer combination) and thus unambiguous, it is not self-synchronizing, so we cannot just sneak into a stream of encoded characters and definitely get a correct output. This would require having a synchronization "character" that is not contained in any other letter - but that is not possible as this is a no-redundancy scheme.

Related

Hashing functions and Universal Hashing Family

I need to determine whether the following Hash Functions Set is universal or not:
Let U be the set of the keys - {000, 001, 002, 003, ... ,999} - all the numbers between 0 and 999 padded with 0 in the beginning where needed. Let n = 10 and 1 < a < 9 ,an integer between 1 and 9. We denote by ha(x) the rightmost digit of the number a*x.
For example, h2(123) = 6, because, 2 * 123 = 246.
We also denote H = {h1, h2, h3, ... ,h9} as our set of hash functions.
Is H is universal? prove.
I know I need to calculate the probability for collision of 2 different keys and check if it's smaller or equal to 1/n (which is 1/10), so I tried to separate into cases - if a is odd or even, because when a is even the last digit of a*x will be 0/2/4/6/8, else it could be anything. But it didn't help me so much as I'm stuck on it.
Would be very glad for some help here.

unknown non-binary data encoding - any hints?

I' trying to decode data sent via RF by a weather station.
Unfortunately, the data representation isn't in standard binary way (0000, 0001, 0010, 0011, ...). What I've found is the following scheme:
value representation
0 => 0xff = 0b11111111
1 => 0x00 = 0b00000000
2 => 0x01 = 0b00000001
3 => 0xfe = 0b11111110
4 => 0x03 = 0b00000011
5 => 0xfc = 0b11111100
6 => 0xfd = 0b11111101
7 => 0x02 = 0b00000010
...
Or broken down to the bits:
value: 0 8 16 24
| | | |
Bit 0: 1010101010101010101010101010 ...
Bit 1: 1001100110011001100110011001
Bit 2: 1001011010010110100101101001
Bit 3: 1001011001101001100101100110
Bit 4: 1001011001101001011010011001
Bit 5: 1001011001101001011010011001
Bit 6: 1001011001101001011010011001
Bit 7: 1001011001101001011010011001
Each bit seems to follow a certain pattern of mirroring and inversion of the preceding, e.g. bit 3 = 10 01 0110 01101001
What is that kind of encoding called like, and how to easily convert it to a standard binary form?
It looks like the LSB pattern is periodic with period 2 (10 repeated), the next bit is periodic with period 4 (1001 repeated), and presumably the bit before that has period 8 (10010110 repeated).
This is somewhat similar to the normal representation, of course, except that usually the repeating patterns are 01, 0011, 00001111 etcetera.
It seems the pattern 1001 is created by copying 10 and inverting the second copy. Similarly, the pattern 100100110 is created by copying and inverting 1001. Hence, the next pattern of period 16 would be 10010011001101001.
Now, how are these patterns related?
For the lowest bit, 10 repeated is 01 repeated XOR (11). Simple.
For the next bit, 1001 repeated is 0011 XOR (1010) repeated - and note that the LSB pattern was 10 repeated.
After that, we get 10010110 repeated which is 00001111 XOR (10011001) repeated. See the pattern?
So: You need to XOR each bit with the bit to its right, starting from the MSB.

Borrow during subtracting operation (sbc asm instruction) on 6502?

When the borrow (i.e. carry flag is cleared) happens during subtracting operation (sbc asm instruction) on 6502 used by NES? Is it each time the result is negative (-1 to -128)?
Many thanks!
Thanks
STeN
On a 6502 SBC n is exactly identical to ADC (n EOR $FF) — it's one's complement. So carry is clear when A + (operand ^ 0xff) + existing carry is less than 256.
EDIT: so, if carry is set then the subtraction occurs without borrow. If carry is clear then subtraction occurs with borrow. Therefore if carry is set after the subtraction then there was no borrow. If carry is clear then there was borrow.
If you want to test whether a result is negative, check the sign bit implicitly via a BMI or BPL.
It's a bit more complicated than that if in decimal mode on a generic 6502 but the NES variant doesn't have decimal mode so ignore anything you read about that.
To clarify re: the comments below; if you're treating numbers as signed then 127 is +127, 128 is -128, etc. Normal two's complement. Nothing special. E.g.
LDA #-63 ; i.e. 1100 0001
SEC
SBC #65 ; i.e. 0100 0001
; result in accumulator is now -128, i.e. 1000 0000,
; and carry remains set because there was no borrow
BPL somewhere ; wouldn't jump, because -128 is negative
BMI somewhereElse ; would jump, because -128 is negative
The following is exactly equivalent in terms of inner workings:
LDA #-63 ; i.e. 1100 0001
SEC ; ... everything the same up until here ...
ADC #65 ; i.e. 1011 1110 (the complement of 0100 0001)
; result = 1100 0001 + 1011 1110 + 1 = [1] 0111 1111 + 1 = [1] 1000 0000
; ^
; |
; carry
; = -128
So, as above, defining "the result" as per the 6502 manual and ordinary programmatic meaning of "the thing sitting in the accumulator", you can test whether the result is positive or negative as stated above, e.g.
SBC $23
BMI resultWasNegative
resultWasPositive: ...
If you're interested in whether the complete result would have been negative (i.e. had it fitted into the accumulator) then you can also check the overflow flag. If overflow is set then that means that whatever is in the accumulator has the wrong sign because of the 8-bit limit. So you can do the equivalent of an exclusive OR between overflow and sign:
SBC $23
BVC signIsTheOpposite
BMI resultWasNegative
JMP resultWasPositive
signIsTheOpposite:
BPL resultWasNegative
JMP resultWasPositive
Tommy's answer is correct, but I have a simpler way of looking at it.
Operations in the 6502's ALU are all 8 bit so you can think of a subtraction like this (for $65 and $64):
01100101
-01100100
========
00000001
What I do is imagine the subtraction is a 9 bit (unsigned) operation with the 9th bit of the accumulator set to 1, so $65 - $64 would look like this:
1 01100101
- 01100100
==========
1 00000001
Whereas $64 - $65 would look like this
1 01100100
- 01100101
==========
0 11111111
The new carry bit is the imaginary 9th bit of the result.
Essentially, the carry is set when the operand interpreted as an unsigned number is greater than the accumulator interpreted as an unsigned number. Or to be pedantic when
A < operand - 1 + oldcarry
Nope, the result may as well be positive.
Example:
lda #$10
sec
sbc #$f0
Carry will be clear after that and Accumulator will be $20.
To test for positive/negative values after substraction use the N(egative)-flag of the status-register and the branches evaluating it (BMI/BPL).

Transforming ciphertext from digital format to alphabetic format

Consider a message "STOP" which we are to encrypt using the RSA algorithm. The values given are p = 43, q = 59, n = pq, e = 13. At first I have transformed "STOP" into blocks of 4-bit code which are 1819 (S = 18 and T = 19) and 1415 (O = 14, P = 15) respectively (alphabets are numbered from 00 to 25).
Finally after calculation I have got 20812182 as the encrypted message (after combining 2081 and 2182). Is there any way to transform this digital code of the ciphertext to the alphabet form?
If we start by considering 2 bits, then 20 = U, 81 = ?, 21 = V, 82 = ?,what will be the alphabets for 81 and 82? I mean to ask,what will be the ciphertext for the plaintext "STOP" in the above case?
RSA works with numbers not binary data nor letters. You can of course convert one to another. E.g. this is what you did when you wrote 20812182. The number with that value can have an endless number of other representations.
Now creating an alphabetical representation that has a minimum size is pretty tricky to do. Basically you can divide by powers of 26. This is however not easy to implement. Instead you can take a subset of your alphabet and use that to represent your number.
To do this use your original number representation and replace 0 with A, 1 with B ... and 9 with J. This would result in CAIBCBIC for your ciphertext.
Note that plaintext and ciphertext are used as names for the input and output of cryptographic ciphers. Both names seem to indicate some kind of human readable text - and maybe they once did - but in cryptography they can be thought of as any kind of data.

how to create unique integer number from 3 different integers numbers(1 Oracle Long, 1 Date Field, 1 Short)

the thing is that, the 1st number is already ORACLE LONG,
second one a Date (SQL DATE, no timestamp info extra), the last one being a Short value in the range 1000-100'000.
how can I create sort of hash value that will be unique for each combination optimally?
string concatenation and converting to long later:
I don't want this, for example.
Day Month
12 1 --> 121
1 12 --> 121
When you have a few numeric values and need to have a single "unique" (that is, statistically improbable duplicate) value out of them you can usually use a formula like:
h = (a*P1 + b)*P2 + c
where P1 and P2 are either well-chosen numbers (e.g. if you know 'a' is always in the 1-31 range, you can use P1=32) or, when you know nothing particular about the allowable ranges of a,b,c best approach is to have P1 and P2 as big prime numbers (they have the least chance to generate values that collide).
For an optimal solution the math is a bit more complex than that, but using prime numbers you can usually have a decent solution.
For example, Java implementation for .hashCode() for an array (or a String) is something like:
h = 0;
for (int i = 0; i < a.length; ++i)
h = h * 31 + a[i];
Even though personally, I would have chosen a prime bigger than 31 as values inside a String can easily collide, since a delta of 31 places can be quite common, e.g.:
"BB".hashCode() == "Aa".hashCode() == 2122
Your
12 1 --> 121
1 12 --> 121
problem is easily fixed by zero-padding your input numbers to the maximum width expected for each input field.
For example, if the first field can range from 0 to 10000 and the second field can range from 0 to 100, your example becomes:
00012 001 --> 00012001
00001 012 --> 00001012
In python, you can use this:
#pip install pairing
import pairing as pf
n = [12,6,20,19]
print(n)
key = pf.pair(pf.pair(n[0],n[1]),
pf.pair(n[2], n[3]))
print(key)
m = [pf.depair(pf.depair(key)[0]),
pf.depair(pf.depair(key)[1])]
print(m)
Output is:
[12, 6, 20, 19]
477575
[(12, 6), (20, 19)]