Hash collision with 2 strings in Python - hash

In the MIT OpenCourseWare YouTube lecture on hashing, the professor gives an example of two strings causing a hash collision in Python:
>>> hash('\0B')
64
>>> hash('\0\0C')
64
Why does this happen?

Related

Which hashing algorithm generates alphanumeric output?

I am looking for an hashing algorithm that generates alphanumeric output. I did few tests with MD5 , SHA3 etc and they produce hexadecimal output.
Example:
Input: HelloWorld
Output[sha3_256]: 92dad9443e4dd6d70a7f11872101ebff87e21798e4fbb26fa4bf590eb440e71b
The 1st character in the above output is 9. Since output is in HEX format, maximum possible values are [0-9][a-f]
I am trying to achieve maximum possible values for the 1st character. [0-9][a-z][A-Z]
Any ideas would be appreciated . Thanks in advance.
Where MD5 computes a 128bit hash and SHA256 a 256bit hash, the output they provide is nothing more than a 128, respectively 256 long binary number. In short, that are a lot of zero's and ones. In order to use a more human-friendly representation of binary-coded values, Software developers and system designers use hexadecimal numbers, which is a representation in base(16). For example, an 8-bit byte can have values ranging from 00000000 to 11111111 in binary form, which can be conveniently represented as 00 to FF in hexadecimal.
You could convert this binary number into a base(32) if you want. This is represented using the characters "A-Z2-7". Or you could use base(64) which needs the characters "A-Za-z0-9+/". In the end, it is just a representation.
There is, however, some practical use to base(16) or hexadecimal. In computer lingo, a byte is 8 bits and a word consists of two bytes (16 bits). All of these can be comfortably represented hexadecimally as 28 = 24×24 = 16×16. Where 28 = 25×23 = 32×8. Hence, in base(32), a byte is not cleanly represented. You already need 5 bytes to have a clean base(32) representation with 8 characters. That is not comfortable to deal with on a daily basis.

Decoding muli-length opcodes (SPU ISA)

I have produced a dump of 32-bit instructions in hex from an assembler I implemented. A subset of the instruction dump is show below:
The opcodes for the instructions are of lengths 4, 7, 8, 9, and 11. They are always the first bits in the instruction. I'm having trouble understanding how I would decode the instructions if the opcodes are of different lengths?
For example: When I read a single instruction, how would I know how many bits I should read for the opcode?
Here is an image of the instruction formats:
Thank you
I figured it out.
I read the maximum number of opcode bits (11) for all instructions, and ignore the bits that don't make sense (i.e the bits that wouldn't result in a possible opcode).

When is htonl(x) != ntohl(x) ? (Or when is converting to and from Network Byte Order not equivalent on the same machine?)

In regards to htonl and ntohl. When would either of these two lines of code evaluate to false.
htonl(x) == ntohl(x);
htonl(ntohl(x)) == htonl(htonl(x));
In other words, when are these two operations not equivalent on the same machine? The only scenario I can think of is a machine that does not work on 2's complement for representing integers.
Is the reason largely historical, for coding clarity, or for something else?
Do any modern architectures or environments exists today where these converting to and from network byte order on the same machine is not the same code in either direction?
I wrote a TCP/IP stack for a UNIVAC 1100 series mainframe many years ago. This was a 36 bit, word addressable computer architecture with 1's complement arithmetic.
When this machine did communications I/O, 8 bit bytes arriving from the outside world would get put into the lower 8 bits of each 9 bit quarter-word. So on this system, ntohl() would squeeze 8 bits in each quarter word down into the lower 32 bits of the word (with the top 4 bits zero) so you could do arithmetic on it.
Likewise, htonl() would take the lower 32 bits in a word and undo this operation to put each 8 bit quantity into the lower 8 bits of each 9 bit quarter word.
So to answer the original question, the ntohl() and htonl() operations on this computer architecture were very different from each other.
For example:
COMP* . COMPRESS A WORD
LSSL A0,36 . CLEAR OUT A0
LSSL A1,1 . THROW AWAY TOP BIT
LDSL A0,8 . GET 8 GOOD ONE'S
LSSL A1,1 .
LDSL A0,8 .
LSSL A1,1 .
LDSL A0,8 .
LSSL A1,1 .
LDSL A0,8 .
J 0,X9 .
.
DCOMP* . DECOMPRESS A WORD
LSSL A0,36 . CLEAR A0
LSSL A1,4 . THROW OUT NOISE
LDSL A0,8 . MOVE 8 GOOD BITS
LSSL A0,1 . ADD 1 NOISE BIT
LDSL A0,8 . MOVE 8 GOOD BITS
LSSL A0,1 . ADD 1 NOISE BIT
LDSL A0,8 . MOVE 8 GOOD BITS
LSSL A0,1 . ADD 1 NOISE BIT
LDSL A0,8 . MOVE 8 GOOD BITS
J 0,X9 .
COMP is the equivalent to ntohl() and DCOMP to htonl(). For those not familiar with UNIVAC 1100 assembly code :-) LSSL is "Left Single Shift Logical" a registers by a number of positions. LDSL is "Left Double Shift Logical" a pair of registers by the specified count. So LDSL A0,8 shifts the concatenated A0, A1 registers left 8 bits, shifting the high 8 bits of A1 into the lower 8 bits of A0.
This code was written in 1981 for a UNIVAC 1108. Some years later, when we had an 1100/90 and it grew a C compiler, I started a port of the BSD NET/2 TCP/IP implementation and implemented ntohl() and htonl() in a similar way. Sadly, I never completed that work..
If you wonder why some of the Internet RFCs use the term "octet", its because some computers in the day (like PDP-10s, Univacs, etc.) had "bytes" that were not 8 bits. An "octet" was defined specifically to be an 8 bit byte.
I couldn't find the original draft of the Posix spec, but a recent one found online has a hint.
Network byte order may not be convenient for processing actual values.
For this, it is more sensible for values to be stored as ordinary
integers. This is known as ‘‘host byte order ’’. In host byte order:
The most significant bit might not be stored in the first byte in address order.
**Bits might not be allocated to bytes in any obvious order at all.**
8-bit values stored in uint8_t objects do not
require conversion to or from host byte order, as they have the same
representation. 16 and 32-bit values can be converted using the
htonl(), htons(), ntohl(),and ntohs() functions.
Interesting though is the the following statement is made under the discussion of
The POSIX standard explicitly requires 8-bit char and two’s-complement arithmetic.
So that basically rules out my idea of a 1's complement machine implementation.
But the "any obvious order at all" statement basically suggests that the posix committee at least considered the possibility of posix/unix running on something other than big or little endian. As such declaring htonl and ntohl as differnet implementations can't be ruled out.
So the short answer is "htonl and ntohl are the same implementation, but the interface of two different functions is for future compatibility with the unknown."
Not all machines will have the same endianness, and these methods take care of that. It is given that 'network order' is big endian. If you have a machine that is running a big endian architecture and you run ntohl, the output will be the same as the input (because the endianness is the same as network). If your machine is a little endian architecture, ntohl will convert the data from big to little endian. The same can be said about htonl (converts host data to network byte order when necessary). To answer your question, these two operations are not equivalent when you're transmitting data between two machines with different endianness.

Is there any classic 3 byte fingerprint function?

I need a checksum/fingerprint function for short strings (say, 16 to 256 bytes) which fits in a 24 bits word. Is there any well known algorithm for that?
I propose to use a 24-bit CRC as an easy solution. CRCs are available in all lengths and always simple to compute. Wikipedia has a matching entry. The quality is far better than a modulo-reduced sum, because swapping characters will most likely produce a different CRC.
The next step (if it is a real threat to have a wrong string with the same checksum) would be a cryptographic MAC like CMAC. While this is too long out of the book, it can be reduced by taking the first 24 bits.
Simplest thing to do is a basic checksum - add up the bytes in the string, mod (2^24).
You have to watch out for character set issues when converting to bytes though, so everyone agrees on the same encoding of characters to bytes.

md5 hash or crc32 which one to use in this case

I need a hash that can be represented in less than 26 chars
Md5 produces 32 chars long string , if convert it to base 36 how good will it be,
I am need of hash not for cryptography but rather for uniqueness basically identifying each input dependent on time of input and input data. currently i can think of this as
$hash=md5( str_ireplace(".","",microtime()).md5($input_data) ) ;
$unique_id= base_convert($hash,16,36) ;
should go like this or use crc32 which will give smaller hash size but i afraid it wont be that unique ?
I think a much simpler solution could take place.
According to your statement, you have 26 characters of space. However, to clarify what I understand to be character and what you understand to be character, let's do some digging.
The MD5 hash acc. to wikipedia produces 16 byte hashes.
The CRC32 algorithm prodces 4 byte hashes.
I understand "characters" (in the most simplest sense) to be ASCII characters. Each ascii character (eg. A = 65) is 8 bits long.
The MD5 aglorithm produces has 16 bytes * 8 bits per byte = 128 bits, CRC32 is 32 bits.
You must understand that hashes are not mathematically unique, but "likely to be unique."
So my solution, given your description, would be to then represent the bits of the hash as ascii characters.
If you only have the choice between MD5 and CRC32, the answer would be MD5. But you could also fit a SHA-1 160 bit hash < 26 character string (it would be 20 ascii characters long).
If you are concerned about the set of symbols that each hash uses, both hashes are in the set [A-Za-z0-9] (I believe).
Finally, when you convert what are essentially numbers from one base to another, the number doesn't change, therefore the strength of the algorithm doesn't change; it just changes the way the number is represented.