I was looking for a Lua implementation of CRC32 and stumbled upon this:
https://github.com/openresty/lua-nginx-module/blob/master/t/lib/CRC32.lua
However it returns the integer hash, how would I go about getting the string equivalent of it?
Using the input "something" it returns: 1850105976
Using an online CRC32 generator I get: "879fb991"
There are many CRC-32 algorithms. You can find ten different CRC-32s documented in this catalog. The Lua code you found and the online CRC32 you found (somewhere -- no link was provided) are different CRC-32s.
What you seem to mean by a "string equivalent" is the hexadecimal representation of the 32-bit integer. In Lua you can use string.format with the print format %x to get hexadecimal. For the example you gave, 1850105976, that would be 6e466078.
Your "online CRC32 generator" appears to be using the BZIP2 CRC-32, though it is showing you the bytes of the resulting CRC in reversed order (little-endian). So the actual CRC in that case in hexadecimal is 91b99f87. The Lua code you found appears to be using the MPEG-2 CRC-32. The only difference between those is the exclusive-or with ffffffff. So in fact the exclusive-or of the two CRCs you got from the two different sources, 6e466078 ^ 91b99f87 is ffffffff.
Related
Being a pentester, I have encountered a hash divided in two parts (the first one probably being the salt) seemingly encoded in Base64 but I am unable to find out the encryption type.
The input that gave me this hash is the string "password". Is anybody able to give me a hint ?
67Wm8zeMSS0=
s9bD0QOa7A6THDMLa39+3LmXgcxzUFdmszeZdlTUzjY=
Thanks in advance
Maybe it's SHA-256 encoded (or any other 256 bit hash algorithm), because if you base64 decode it and hex encode you get:
ebb5a6f3378c492d
b3d6c3d1039aec0e931c330b6b7f7edcb99781cc73505766b337997654d4ce36
The first has an length of 16 and the second a length of 64. That's probably not a coincidence.
Edit: Maybe it's hashed multiple times; an iterated hash. As this post says it is better to decompile the software.
I have seen two definitions for big endian/small endian which cause my confusion.
The first definition is the classic one related to machine:
Big-endian systems store the most significant byte of a word in the smallest address and the least significant byte is stored in the largest address (also see Most significant bit). Little-endian systems, in contrast, store the least significant byte in the smallest address.
This makes perfect sense and this is the definition of big/small endian in my whole life until I came across various discussions related to cryptography:
book "Cryptography for Developers" By Tom St Denis says, "the OS2IP function converts the octet string to integer by loading the octet strings in big endian fashion. That is, the first byte is the most significant."
https://crypto.stackexchange.com/questions/10824/what-does-an-rsa-signature-look-like/10826#10826
In the accepted answer of this question, it says, "The padded value is then interpreted as an integer x, by decoding it with the big-endian convention."
Apparently, these two crypto discussions does not involve anything related to the machine architecture. What is their definition of big-endian fashion/convention?
Big and little endian are just conventions about representing numbers with bytes. In big endian, the most significant byte comes first, in the little endian it's the other way around. Different architectures, data formats, algorithms and networking protocols may adopt different strategies.
Moreover, good programs will not depend on the endianness of the architecture, for example, to read a number from an array you could write something like:
int read_bit_endian_16(unsigned char *data) {
return (data[0] << 8) + data[1];
}
or using functions like ntohs() and friends.
In Python it's:
struct.unpack('>h', data)
Binary data formats are good example of when endianness is important, if you expect them to be cross-platform. If you write data in a low-endian platform, you want to be able to read it in a big-endian one. That's why any decent format specify those things explicitly, and portable programs take into account chances of being compiled/run in different architectures. Other example would be multibyte character encodings like UTF16-LE and UTF16-BE.
You can find a more detailed explanation here
In http://nedbatchelder.com/text/unipain.html it is explained that:
In Python 2, there are two different string data types. A plain-old
string literal gives you a "str" object, which stores bytes. If you
use a "u" prefix, you get a "unicode" object, which stores code
points.
What's the difference between code point vs byte? (I'm thinking not really in term of Python per se but just the concept in general). Essentially it's just a bunch of bits, right? I think of pain old string literal treat each 8-bits as a byte and is handled as such, and we interpret the byte as integers and that allow us to map it to ASCII and the extended character sets. What's the difference between interpreting integer as that set of characters and interpreting the "code point" as Unicode characters? It says Python's Unicode object stores "code point". Isn't that just the same as plain old bytes except possibly the interpretation (where bits of each Unicode character starts and stops as utf-8, for example)?
A code point is a number which acts as an identifier for a Unicode character. A code point itself cannot be stored, it must be encoded from Unicode into bytes in e.g. UTF-16LE. While a certain byte or sequence of bytes can represent a specific code point in a given encoding, without the encoding information there is nothing to connect the code point to the bytes.
A computer system is based on binary system. Data/instructions are encoded in binary. Encoding can be carried out in many formats - ASCII, UNICODE etc.
Is a microprocessor made for a chosen 'encoding format' ? if yes, how would it become compatible to other encoding formats? wouldn't there be a performance penalty in that case?
when we create a program, how its encoding format is chosen?
ASCII and UNICODE are encoding of text data and have nothing about binary data.
No, all microprocessors know about is binary numbers - they don't have a clue about the meaning of those numbers. That meaning is provided by us and by our tools used to build programs. For example, if you compile a C++ program using Visual Studio, it will use multi-byte characters, but the CPU doesn't know that.
One area where the microprocessor architecture does matter is endianness—for example, when you try to read a UTF-16LE encoding file on a big-endian machine, you have to swap the individual bytes of each code unit to get the expected 16-bit integer. This is an issue for all encoding forms whose code unit is wider than one byte. See section 2.6 of the second chapter of the Unicode standard for a more in-depth discussion. The processor itself still works with individual integer numbers, but as a library developer, you have to deal with the mapping from files (i.e., byte sequences) to memory arrays (i.e., code unit sequences).
What's the most efficient way to convert an md5 hash to a unique integer to perform a modulus operation?
Since the solution language was not specified, Python is used for this example.
import os
import hashlib
array = os.urandom(1 << 20)
md5 = hashlib.md5()
md5.update(array)
digest = md5.hexdigest()
number = int(digest, 16)
print(number % YOUR_NUMBER)
You haven't said what platform you're running on, or what the format of this hash is. Presumably it's hex, so you've got 16 bytes of information.
In order to convert that to a unique integer, you basically need a 16-byte (128-bit) integer type. Many platforms don't have such a type available natively, but you could use two long values in C# or Java, or a BigInteger in Java or .NET 4.0.
Conceptually you need to parse the hex string to bytes, and then convert the bytes into an integer (or two). The most efficient way of doing that will entirely depend on which platform you're using.
There is more data in a MD5 than will fit in even a 64b integer, so there's no way (without knowing what platform you are using) to get a unique integer. You can get a somewhat unique one by converting the hex version to several integers worth of data then combining them (addition or multiplication). How exactly you would go about that depends on what language you are using though.
Alot of language's will implement either an unpack or sscanf function, which are good places to start looking.
If all you need is modulus, you don't actually need to convert it to 128-byte integer. You can go digit by digit or byte by byte, like this.
mod=0
for(i=0;i<32;i++)
{
digit=md5[i]; //I presume you can convert chart to digit yourself.
mod=(mod*16+digit) % divider;
}
You'll need to define your own hash function that converts an MD5 string into an integer of the desired width. If you want to interpret the MD5 hash as a plain string, you can try the FNV algorithm. It's pretty quick and fairly evenly distributed.