How to generate hash of arbitrary length with MurmurHash3 32 bit - hash

I am currently trying to hash a set of strings using MurmurHash3, since 32 bit hash seems to be too large for me to handle. I wanted to reduce the number of bit used to generate hashes to around 24 bits. I already found some questions explaining how to reduce it to 16, 8, 4, 2 bit using XOR folding, but these are too few bits for my application.
Can somebody help me?

When you have a 32-bit hash, it's something like (with spaces for readability):
1101 0101 0101 0010 1010 0101 1110 1000
To get a 24-bit hash, you want to keep the lower order 24 bits. The notation for that will vary by language, but many languages use "x & 0xFFF" for a bit-wise AND operation with 0xFFF hex. That effectively does (with the AND logic applied to each vertical column of numbers, so 1 AND 1 is 1, and 0 and 1 is 0):
1101 0101 0101 0010 1010 0101 1110 1000 AND <-- hash value from above
0000 0000 1111 1111 1111 1111 1111 1111 <-- 0xFFF in binary
==========================================
0000 0000 0101 0010 1010 0101 1110 1000
You do waste a little randomness from your hash value though, which doesn't matter so much with a pretty decent hash like murmur32, but you can expect slightly reduced collisions if you instead further randomise the low-order bits using the high order bits you'd otherwise chop off. To do that, right-shift the high order bits and XOR them with lower-order bits (it doesn't really matter which). Again, a common notation for that is:
((x & 0xF000) >> 8) ^ x
...which can be read as: do a bitwise-AND to retrain only the most significant byte of x, then shift that right by 8 bits, then bitwise excluse-OR that with the original value of X. The result of the above expression then has bit 23 (counting from 0 as the least signficant bit) set if and only if one or other (but not both) of bits 23 and 31 were set in the value of x. Similarly, bit 22 is the XOR of bits 22 and 30. So it goes down to bit 16 which is the XOR of bit 16 and bit 24. Bits 0..15 remain the same as in the original value of x.
Yet another approach is to pick a prime number ever-so-slightly lower than 2^24-1, and mod (%) your 32-bit murmur hash value by that, which will mix in the high order bits even more effectively than the XOR above, but you'll obviously only get values up to the prime number - 1, and not all the way to 2^24-1.

Related

CRC16 (ModBus) - computing algorithm

I am using the ModBus RTU, and I'm trying to figure out how to calculate the CRC16.
I don't need a code example. I am simply curious about the mechanism.
I have learned that a basic CRC is a polynomial division of the data word, which is padded with zeros, depending on the length of the polynomial.
The following test example is supposed to check if my basic understanding is correct:
data word: 0100 1011
polynomial: 1001 (x3+1)
padded by 3 bits because of highest exponent x3
calculation: 0100 1011 000 / 1001 -> remainder: 011
Calculation.
01001011000
1001
0000011000
1001
01010
1001
0011
Edit1: So far verified by Mark Adler in previous comments/answers.
Searching for an answer I have seen a lot of different approaches with reversing, dependence on little or big endian, etc., which alter the outcome from the given 011.
Modbus RTU CRC16
Of course I would love to understand how different versions of CRCs work, but my main interest is to simply understand what mechanism is applied here. So far I know:
x16+x15+x2+1 is the polynomial: 0x18005 or 0b11000000000000101
initial value is 0xFFFF
example message in hex: 01 10 C0 03 00 01
CRC16 of above message in hex: C9CD
I did calculate this manually like the example above, but I'd rather not write this down in binary in this question. I presume my transformation into binary is correct. What I don't know is how to incorporate the initial value -- is it used to pad the data word with it instead of zeros? Or do I need to reverse the answer? Something else?
1st attempt: Padding by 16 bits with zeros.
Calculated remainder in binary would be 1111 1111 1001 1011 which is FF9B in hex and incorrect for CrC16/Modbus, but correct for CRC16/Bypass
2nd attempt: Padding by 16 bits with ones, due to initial value.
Calculated remainder in binary would be 0000 0000 0110 0100 which is 0064 in hex and incorrect.
It would be great if someone could explain, or clarify my assumptions. I honestly did spent many hours searching for an answer, but every explanation is based on code examples in C/C++ or others, which I don't understand. Thanks in advance.
EDIT1: According to this site, "1st attempt" points to another CRC16-method with same polynomial but a different initial value (0x0000), which tells me, the calculation should be correct.
How do I incorporate the initial value?
EDIT2: Mark Adlers Answer does the trick. However, now that I can compute CRC16/Modbus there are some questions left for clearification. Not needed but appreciated.
A) The order of computation would be: ... ?
1st applying RefIn for complete input (including padded bits)
2nd xor InitValue with (in CRC16) for the first 16 bits
3rd applying RefOut for complete output/remainder (remainder maximum 16 bits in CRC16)
B) Referring to RefIn and RefOut: Is it always reflecting 8 bits for input and all bits for output nonetheless I use CRC8 or CRC16 or CRC32?
C) What do the 3rd (check) and 8th (XorOut) column in the website I am referring to mean? The latter seems rather easy, I am guessing its apllied by computing the value xor after RefOut just like the InitValue?
Let's take this a step at a time. You now know how to correctly calculate CRC-16/BUYPASS, so we'll start from there.
Let's take a look CRC-16/CCITT-FALSE. That one has an initial value that is not zero, but still has RefIn and RefOut as false, like CRC-16/BUYPASS. To compute the CRC-16/CCITT-FALSE on your data, you exclusive-or the first 16 bits of your data with the Init value of 0xffff. That gives fe ef C0 03 00 01. Now do what you know on that, but with the polynomial 0x11021. You will get what is in the table, 0xb53f.
Now you know how to apply Init. The next step is dealing with RefIn and RefOut being true. We'll use CRC-16/ARC as an example. RefIn means that we reflect the bits in each byte of input. RefOut means that we reflect the bits of the remainder. The input message is then: 80 08 03 c0 00 80. Dividing by the polynomial 0x18005 we get 0xb34b. Now we reflect all of those bits (not in each byte, but all 16 bits), and we get 0xd2cd. That is what you see as the result in the table.
We now have what we need to compute CRC-16/MODBUS, which has both a non-zero Init value (0xffff) and RefIn and RefOut as true. We start with the message with the bits in each byte reflected and the first 16 bits inverted. That is 7f f7 03 c0 00 80. Divide by 0x18005 and you get the remainder 0xb393. Reflect those bits and we get 0xc9cd, the expected result.
The exclusive-or of Init is applied after the reflection, which you can verify using CRC-16/RIELLO in that table.
Answers for added questions:
A) RefIn has nothing to do with the padded bits. You reflect the input bytes. However in a real calculation, you reflect the polynomial instead, which takes care of both reflections.
B) Yes.
C) Yes, XorOut is the what you exclusive-or the final result with. Check is the CRC of the nine bytes "123456789" in ASCII.

Cache memory logic

A computer has 1MB RAM and has a word size of 8 bits. Its has cache memory having 16 blocks with a block size of 32 bits. Show how the main memory address
1000 1111 1010 0101 1101 will be mapped to cache address, if
i) Direct cache mapping is used
ii) Associative cache mapping is used
iii)Two way Set associative cache mapping is used
Please enlighten me on how to solve this problem.I have looked all over and there is no detailed explanation on this.
32 bit is 4 bytes, you need 2 bits to address these 4 bytes (2^2) so you split off the 2 least significant bits from the address.
1000 1111 1010 0101 11-01
Direct mapped means it can go only into one place in the cache, there are 16 places in the cache so we must peel off the next least 4 bits (2^4=16) getting
1000 1111 1010 01-01 11-01
so 0111 (=7) is the line that gets filled.
If (Fully) Associative cache mapping is used it can go in any of the 16 positions.
Using two way Set associative cache mapping is alike direct mapped but where we split the cache in half (size=8=2^3), giving 2 possible position it can be stored in.
1000 1111 1010 010-1 11-01
so 111 is the index and any of the 2 possible positions can be used.
Read all about caches here.

How are bits arranged in a reduced-range variable in Ada?

Let's say I've created a type in Ada:
type Coord_Type is range -32 .. 31;
What can I expect the bits to look like in memory, or specifically when transmitting this value to another system?
I can think of two options.
One is that the full (default integer?) space is used for all variables of "Coord_Type", but only the values within the range are possible. If I assume 2s complement, then, the value 25 and -25 would be possible, but not 50 or -50:
0000 0000 0001 1001 ( 25)
1111 1111 1110 0111 (-25)
0000 0010 0011 0010 ( 50) X Not allowed
1111 1111 1100 1110 (-50) X Not allowed
The other option is that the space is compressed to only what is needed. (I chose a byte, but maybe even only 6 bits?) So with the above values, the bits might be arranged as such:
0000 0000 0001 1001 ( 25)
0000 0000 1110 0111 (-25)
0000 0000 0011 0010 ( 50) X Not allowed
0000 0000 1100 1110 (-50) X Not allowed
Essentially, does Ada further influence the storage of values beyond limiting what range is allowed in a variable space? Is this question, Endianness, and 2s complement even controlled by Ada?
When you declare the type like that, you leave it up to the compiler to choose the optimal layout for each architecture. You might even get binary-coded-decimal (BCD) instead of two's complement on some architectures.

Virtual Memory Address in Binary form

Please help me out, im studying operating systems. under virtual memory i found this:
A user process generates a virtual address 11123456. and it is said the virtual address in binary form is 0001 0001 0001 0010 0011 0100 0101 0110. how was that converted because when i convert 11123456 to bin i get 0001 0101 0011 0111 0110 000 0000. it is said The virtual memory is implemented by paging, and the page size is 4096 bytes
You assume that 11123456 is a decimal number, while according to the result it's hexadecimal. In general, decimal numbers are rarely used in CS, representation in orders of 2 is much more common and convenient. Today mostly used are base 16 (hexadecimal) and 2 (binary).
Converting into binary may help to identify the page number and offset so that you can calculate the physical address corresponding to the logical address. It should be good if you can understand how to do this if you are CS student.
For the particular problem, i.e. paging, you can convert from logical to physical address without converting into binary using modulo (%) and divide (/) operators. However, doing things in binary is original way for this.
In your question, the value 11123456 should be a hexadecimal number and it should be written as 0x11123456 to distinguish with the decimal numbers. And from the binary format "0001 0001 0001 0010 0011 0100 0101 0110", we can infer that the offset of the logical address is "0100 0101 0110" (12 rightmost bits, or 132182 in decimal, or 0x20456 in hexadecimal) and the page number is "0001 0001 0001 0010 0011" (the rest bits, 69923 in decimal, or 0x11123 in hexadecimal).

Why is a SHA-1 Hash 40 characters long if it is only 160 bit?

The title of the question says it all. I have been researching SHA-1 and most places I see it being 40 Hex Characters long which to me is 640bit. Could it not be represented just as well with only 10 hex characters 160bit = 20byte. And one hex character can represent 2 byte right? Why is it twice as long as it needs to be? What am I missing in my understanding.
And couldn't an SHA-1 be even just 5 or less characters if using Base32 or Base36 ?
One hex character can only represent 16 different values, i.e. 4 bits. (16 = 24)
40 × 4 = 160.
And no, you need much more than 5 characters in base-36.
There are totally 2160 different SHA-1 hashes.
2160 = 1640, so this is another reason why we need 40 hex digits.
But 2160 = 36160 log362 = 3630.9482..., so you still need 31 characters using base-36.
I think the OP's confusion comes from a string representing a SHA1 hash takes 40 bytes (at least if you are using ASCII), which equals 320 bits (not 640 bits).
The reason is that the hash is in binary and the hex string is just an encoding of that. So if you were to use a more efficient encoding (or no encoding at all), you could take only 160 bits of space (20 bytes), but the problem with that is it won't be binary safe.
You could use base64 though, in which case you'd need about 27-28 bytes (or characters) instead of 40 (see this page).
There are two hex characters per 8-bit-byte, not two bytes per hex character.
If you are working with 8-bit bytes (as in the SHA-1 definition), then a hex character encodes a single high or low 4-bit nibble within a byte. So it takes two such characters for a full byte.
My answer only differs from the previous ones in my theory as to the EXACT origin of the OP's confusion, and in the baby steps I provide for elucidation.
A character takes up different numbers of bytes depending on the encoding used (see here). There are a few contexts these days when we use 2 bytes per character, for example when programming in Java (here's why). Thus 40 Java characters would equal 80 bytes = 640 bits, the OP's calculation, and 10 Java characters would indeed encapsulate the right amount of information for a SHA-1 hash.
Unlike the thousands of possible Java characters, however, there are only 16 different hex characters, namely 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E and F. But these are not the same as Java characters, and take up far less space than the encodings of the Java characters 0 to 9 and A to F. They are symbols signifying all the possible values represented by just 4 bits:
0 0000 4 0100 8 1000 C 1100
1 0001 5 0101 9 1001 D 1101
2 0010 6 0110 A 1010 E 1110
3 0011 7 0111 B 1011 F 1111
Thus each hex character is only half a byte, and 40 hex characters gives us 20 bytes = 160 bits - the length of a SHA-1 hash.
2 hex characters mak up a range from 0-255, i.e. 0x00 == 0 and 0xFF == 255. So 2 hex characters are 8 bit, which makes 160 bit for your SHA digest.
SHA-1 is 160 bits
That translates to 20 bytes = 40 hex characters (2 hex characters per byte)