Different results from Murmur3 from Scala and Guava - scala

I am trying to generate hashes using the Murmur3 algorithm. The hashes are consistent but they are different values being returned by Scala and Guava.
class package$Test extends FunSuite {
test("Generate hashes") {
println(s"Seed = ${MurmurHash3.stringSeed}")
val vs = Set("abc", "test", "bucket", 111.toString)
vs.foreach { x =>
println(s"[SCALA] Hash for $x = ${MurmurHash3.stringHash(x).abs % 1000}")
println(s"[GUAVA] Hash for $x = ${Hashing.murmur3_32().hashString(x).asInt().abs % 1000}")
println(s"[GUAVA with seed] Hash for $x = ${Hashing.murmur3_32(MurmurHash3.stringSeed).hashString(x).asInt().abs % 1000}")
println()
}
}
}
Seed = -137723950
[SCALA] Hash for abc = 174
[GUAVA] Hash for abc = 419
[GUAVA with seed] Hash for abc = 195
[SCALA] Hash for test = 588
[GUAVA] Hash for test = 292
[GUAVA with seed] Hash for test = 714
[SCALA] Hash for bucket = 413
[GUAVA] Hash for bucket = 22
[GUAVA with seed] Hash for bucket = 414
[SCALA] Hash for 111 = 250
[GUAVA] Hash for 111 = 317
[GUAVA with seed] Hash for 111 = 958
Why am I getting different hashes?

It looks to me like Scala's hashString converts pairs of UTF-16 chars to ints differently than Guava's hashUnencodedChars (hashString with no Charset was renamed to that).
Scala:
val data = (str.charAt(i) << 16) + str.charAt(i + 1)
Guava:
int k1 = input.charAt(i - 1) | (input.charAt(i) << 16);
In Guava, the char at an index i becomes the 16 least significant bits of the the int and the char at i + 1 becomes the most significant 16 bits. In the Scala implementation, that's reversed: the char at i is the most significant while the char at i + 1 is the least significant. (The fact that the Scala implementation uses + rather than | could also be significant I imagine.)
Note that the Guava implementation is equivalent to using ByteBuffer.putChar(c) twice to put two characters into a little endian ByteBuffer, then using ByteBuffer.getInt() to get an int value back out. The Guava implementation is also equivalent to encoding the characters to bytes using UTF-16LE and hashing those bytes. The Scala implementation is not equivalent to encoding the string in any of the standard charsets that JVMs are required to support. In general, I'm not sure what precedent (if any) Scala has for doing it the way it does.
Edit:
The Scala implementation does another thing different than the Guava implementation as well: it passes the number of chars being hashed to the finalizeHash method where Guava's implementation passes the number of bytes to the equivalent fmix method.

I believe hashString(x, StandardCharsets.UTF_16BE) should match Scala's behavior. Let us know.
(Also, please upgrade your Guava to something newer!)

Related

Find a string for which hash() starts with 0000

I've got a task from my professor and unfortunately I'm really confused.
The task:
Find a string D1 for which hash(D1) contains 4 first bytes equal 0.
So it should look like "0000....."
As I know we cannot just decrypt a hash, and checking them one by one is kind of pointless work.
I've got a task from my professor...
Find a string D1 for which hash(D1) contains 4 first bytes equal 0. So it should look like "0000....."
As I know we cannot just decrypt a hash, and checking them one by one is kind of pointless work.
In this case it seem like the work is not really "pointless." Rather, you are doing this work because your professor asked you to do it.
Some commenters have mentioned that you could look at the bitcoin blockchain as a source of hashes, but this will only work if your hash of interest is the same one use by bitcoin (double-SHA256!)
The easiest way to figure this out in general is just to brute force it:
Pseudo-code a la python
for x in range(10*2**32): # Any number bigger than about 4 billion should work
x_str = str(x) # Any old method to generate some bytes to hash should work
x_bytes = x_str.encode('utf-8')
hash_bytes = hash(x_bytes) # assuming hash() returns bytes
if hash_bytes[0:4] == b'\x00\x00\x00\x00':
print("Found string: {}".format(x_str))
break
I wrote a short python3 script, which repeatedly tries hashing random values until it finds a value whose SHA256 hash has four leading zero bytes:
import secrets
import hashlib
while(True):
p=secrets.token_bytes(64)
h=hashlib.sha256(p).hexdigest()
if(h[0:8]=='00000000'): break
print('SHA256(' + p.hex() + ')=' + h)
After running for a few minutes (on my ancient Dell laptop), it found a value whose SHA256 hash has four leading zero bytes:
SHA256(21368dc16afcb779fdd9afd57168b660b4ed786872ad55cb8355bdeb4ae3b8c9891606dc35d9f17c44219d8ea778d1ee3590b3eb3938a774b2cadc558bdfc8d4)=000000007b3038e968377f887a043c7dc216961c22f8776bbf66599acd78abf6
The following command-line command verifies this result:
echo -n '21368dc16afcb779fdd9afd57168b660b4ed786872ad55cb8355bdeb4ae3b8c9891606dc35d9f17c44219d8ea778d1ee3590b3eb3938a774b2cadc558bdfc8d4' | xxd -r -p | sha256sum
As expected, this produces:
000000007b3038e968377f887a043c7dc216961c22f8776bbf66599acd78abf6
Edit 5/8/21
Optimized version of the script, based on my conversation with kelalaka in the comments below.
import secrets
import hashlib
N=0
p=secrets.token_bytes(32)
while(True):
h=hashlib.sha256(p).digest()
N+=1
if(h.hex()[0:8]=='0'*8): break
p=h
print('SHA256(' + p.hex() + ')=' + h.hex())
print('N=' + str(N))
Instead of generating a new random number in each iteration of the loop to use as the input to the hash function, this version of the script uses the output of the hash function from the previous iteration as the input to the hash function in the current iteration. On my system, this quadruples the number of iterations per second. It found a match in 1483279719 iterations in a little over 20 minutes:
$ time python3 findhash2.py
SHA256(69def040a417caa422dff20e544e0664cb501d48d50b32e189fba5c8fc2998e1)=00000000d0d49aaaf9f1e5865c8afc40aab36354bc51764ee2f3ba656bd7c187
N=1483279719
real 20m47.445s
user 20m46.126s
sys 0m0.088s
The sha256 hash of the string $Eo is 0000958bc4dc132ad12abd158073204d838c02b3d580a9947679a6
This was found using the code below which restricts the string to only UTF8 keyboard characters. It cycles through the hashes of each 1 character string (technically it hashes bytes, not strings), then each 2 character string, then each 3 character string, then each 4 character string (it never had to go to 4 characters, so I'm not 100% sure the math for that part of the function is correct).
The 'limit" value is included to prevent the code from running forever in case a match is not found. This ended up not being necessary as a match was found in 29970 iterations and the execution time was nearly instantaneous.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from hashlib import sha256
utf8_chars = list(range(0x21,0x7f))
def make_str(attempt):
if attempt < 94:
c0 = [attempt%94]
elif attempt >= 94 and attempt < 8836:
c2 = attempt//94
c1 = attempt%94
c0 = [c2,c1]
elif attempt >= 8836 and attempt < 830584:
c3 = attempt//8836
c2 = (attempt-8836*c3)//94
c1 = attempt%94
c0 = [c3,c2,c1]
elif attempt >= 830584 and attempt < 78074896:
c4 = attempt//830584
c3 = (attempt-830584*c4)//8836
c2 = ((attempt-830584*c4)-8836*c3)//94
c1 = attempt%94
c0 = [c4,c3,c2,c1]
return bytes([utf8_chars[i] for i in c0])
target = '0000'
limit = 1200000
attempt = 0
hash_value = sha256()
hash_value.update(make_str(attempt))
while hash_value.hexdigest()[0:4] != target and attempt <= limit:
hash_value = sha256()
attempt += 1
hash_value.update(make_str(attempt))
t = ''.join([chr(i) for i in make_str(attempt)])
print([t, attempt])

Scala Convert Between Unsigned and Signed Bytes

I have a unsigned data representation like this:
val str = "145 38 0 255 0 1"
I want to now get this unsigned representation as a byteString after which I can manipulate on the individual bits to extract information. So basically what I want is to get the unsigned version.
For example., the unsigned binary representation of 145 is 10010001, but the signed version is -00000111.
scala> 145.byteValue
res128: Byte = -111 // I would need an unsigned value instead!
So is there a function or an approach to convert the 145 to an unsigned representation?
Signed and unsigned bytes (or Ints or Longs) are just two different ways to interpret the same binary bits.
In signed bytes the first bit from the left (most significant bit) is interpreted as the sign.
0 means positive, 1 means negative.
For negative numbers, what comes after the minus sign is 256 - unsigned-value.
So in your case what we get is 256 - 145 = 111.
Java / Scala bitwise operators work on the underlying bits, not on the signed/unsigned representation, but of course the results, when printed are still interpreted as signed.
Actually, to save on confusion, I would just work with Ints (or Shorts).
But it will work just as well with bytes.
For example to get the n-th bit you could do something like:
def bitValue(byte: Byte, n: Byte) = {
val pow2 = Math.pow(2, n - 1).toInt
if((pow2 & byte) == 0) 0
else 1
}
bitValue(145.byteValue, 8)
res27: Int = 1

Convert a string into hash value in matlab

How can I convert a message into a hash value using SHA/MD5 hashing in MATLAB? is there any builtin function or any fixed code?
There are no functions in matlab to calculate hashes. However, you can call Java (any OS) or .Net (Windows only) functions directly from matlab and either of these implement what you want.
Note that you haven't specified the encoding of the string. The hash is different if you consider the string in ASCII, UTF8, UTF16, etc.
Also note that matlab does not have 160-bit or 256-bit integer, so the hash can't obviously be a single integer.
Anyway, using .Net:
SHA256
string = 'some string';
sha256hasher = System.Security.Cryptography.SHA256Managed;
sha256 = uint8(sha256hasher.ComputeHash(uint8(string)));
dec2hex(sha256)
SHA1
sha1hasher = System.Security.Cryptography.SHA1Managed;
sha1= uint8(sha1hasher.ComputeHash(uint8(string)));
dec2hex(sha1)
Java based solution can be found in the following link
https://www.mathworks.com/matlabcentral/answers/45323-how-to-calculate-hash-sum-of-a-string-using-java
MATLAB's .NET classes appear to be a more recent creation than the JAVA hashing.
However, these classes don't have much/any public documentation available. After playing with it a bit, I found a way to specify one of several hash algorithms, as desired.
The "System.Security.Cryptography.HashAlgorithm" constructor accepts a hash algorithm name (string). Based on the string name you pass in, it returns different hasher classes (.SHA256Managed is only one type). See the example below for a complete string input ==> hash string output generation.
% Available options are 'SHA1', 'SHA256', 'SHA384', 'SHA512', 'MD5'
algorithm = 'SHA1';
% SHA1 category
hasher = System.Security.Cryptography.HashAlgorithm.Create('SHA1'); % DEFAULT
% SHA2 category
hasher = System.Security.Cryptography.HashAlgorithm.Create('SHA256');
hasher = System.Security.Cryptography.HashAlgorithm.Create('SHA384');
hasher = System.Security.Cryptography.HashAlgorithm.Create('SHA512');
% SHA3 category: Does not appear to be supported
% MD5 category
hasher = System.Security.Cryptography.HashAlgorithm.Create('MD5');
% GENERATING THE HASH:
str = 'Now is the time for all good men to come to the aid of their country';
hash_byte = hasher.ComputeHash( uint8(str) ); % System.Byte class
hash_uint8 = uint8( hash_byte ); % Array of uint8
hash_hex = dec2hex(hash_uint8); % Array of 2-char hex codes
% Generate the hex codes as 1 long series of characters
hashStr = str([]);
nBytes = length(hash_hex);
for k=1:nBytes
hashStr(end+1:end+2) = hash_hex(k,:);
end
fprintf(1, '\n\tThe %s hash is: "%s" [%d bytes]\n\n', algorithm, hashStr, nBytes);
% SIZE OF THE DIFFERENT HASHES:
% SHA1: 20 bytes = 20 hex codes = 40 char hash string
% SHA256: 32 bytes = 32 hex codes = 64 char hash string
% SHA384: 48 bytes = 48 hex codes = 96 char hash string
% SHA512: 64 bytes = 64 hex codes = 128 char hash string
% MD5: 16 bytes = 16 hex codes = 32 char hash string
REFERENCES:
1) https://en.wikipedia.org/wiki/SHA-1
2) https://defuse.ca/checksums.htm#checksums
I just used this and it works well.
Works on strings, files, different data types.
For a file I compared against CRC SHA through file explorer and got the same answer.
https://www.mathworks.com/matlabcentral/fileexchange/31272-datahash

Perl - Forming a value from bits lying between two (16 bit) fields (data across word boundaries)

I have am reading some data into 16 bit data words, and extracting VALUES from parts of the 16 bit words. Some of the values I need straddles the word boundaries.
I need to take the bits from the first word and some from the second word and join them to form a value.
I am thinking of the best way to do this. I could bit shift stuff all over the place and compose the data that way, but I am thinking there must be perhaps an easier/better way because I have many cases like this and the values are in some case different sizes (which I know since I have a data map).
For instance:
[TTTTTDDDDPPPPYYY] - 16 bit field
[YYYYYWWWWWQQQQQQ] - 16 bit field
TTTTT = 5 bit value, easily extracted
DDDD = 4 bit value, easily extracted
WWWWW = 5 bit value, easily extracted
QQQQQQ = 6 bit value, easily extracted
YYYYYYYY = 8 bit value, which straddles the word boundaries. What is the best way to extract this? In my case I have a LOT of data like this, so elegance/simplicity in a solution is what I seek.
Aside - In Perl what are the limits of left shifting? I am on a 32 bit computer, am I right to guess that my (duck) types are 32 bit variables and that I can shift that far, even though I unpacked the data as 16 bits (unpack with type n) into a variable? This situation came up in the case of trying to extract a 31 bit variable that lies between two 16 bit fields.
Lastly (someone may ask), reading/unpacking the data into 32 bit words does not help me as I still face the same issue - Data is not aligned on word boundaries but crosses it.
The size of your integers are given (in bytes) by perl -V:ivsize or programatically using use Config qw( %Config ); $Config{ivsize}. They'll have 32 bit in a 32-bit build (since they are guaranteed to be large enough to hold a pointer). That means you can use
my $i = ($hi << 16 | $lo); # TTTTTDDDDPPPPYYYYYYYYWWWWWQQQQQQ
my $q = ($i >> 0) & (2**6-1);
my $w = ($i >> 6) & (2**5-1);
my $y = ($i >> 11) & (2**8-1);
my $p = ($i >> 19) & (2**4-1);
my $d = ($i >> 23) & (2**4-1);
my $t = ($i >> 27) & (2**5-1);
If you wanted to stick to 16 bits, you could use the following:
my $y = ($hi & 0x7) << 5 | ($lo >> 11);
00000[00000000YYY ]
[ YYYYY]WWWWWQQQQQQ
------------------
[00000000YYYYYYYY]

How does Perl store integers in-memory?

say pack "A*", "asdf"; # Prints "asdf"
say pack "s", 0x41 * 256 + 0x42; # Prints "BA" (0x41 = 'A', 0x42 = 'B')
The first line makes sense: you're taking an ASCII encoded string, packing it into a string as an ASCII string. In the second line, the packed form is "\x42\x41" because of the little endian-ness of short integers on my machine.
However, I can't shake the feeling that somehow, I should be able to treat the packed string from the second line as a number, since that's how (I assume) Perl stores numbers, as little-endian sequence of bytes. Is there a way to do so without unpacking it? I'm trying to get the correct mental model for the thing that pack() returns.
For instance, in C, I can do this:
#include <stdio.h>
int main(void) {
char c[2];
short * x = c;
c[0] = 0x42;
c[1] = 0x41;
printf("%d\n", *x); // Prints 16706 == 0x41 * 256 + 0x42
return 0;
}
If you're really interested in how Perl stores data internally, I'd recommend PerlGuts Illustrated. But usually, you don't have to care about stuff like that because Perl doesn't give you access to such low-level details. These internals are only important if you're writing XS extensions in C.
If you want to "cast" a two-byte string to a C short, you can use the unpack function like this:
$ perl -le 'print unpack("s", "BA")'
16706
However, I can't shake the feeling that somehow, I should be able to treat the packed string from the second line as a number,
You need to unpack it first.
To be able to use it as a number in C, you need
char* packed = "\x42\x41";
int16_t int16;
memcpy(&int16, packed, sizeof(int16_t));
To be able to use it as a number in Perl, you need
my $packed = "\x42\x41";
my $num = unpack('s', $packed);
which is basically
use Inline C => <<'__EOI__';
SV* unpack_s(SV* sv) {
STRLEN len;
char* buf;
int16_t int16;
SvGETMAGIC(sv);
buf = SvPVbyte(sv, len);
if (len != sizeof(int16_t))
croak("usage");
Copy(buf, &int16, 1, int16_t);
return newSViv(int16);
}
__EOI__
my $packed = "\x42\x41";
my $num = unpack_s($packed);
since that's how (I assume) perl stores numbers, as little-endian sequence of bytes.
Perl stores numbers in one of following three fields of a scalar:
IV, a signed integer of size perl -V:ivsize (in bytes).
UV, an unsigned integer of size perl -V:uvsize (in bytes). (ivsize=uvsize)
NV, a floating point numbers of size perl -V:nvsize (in bytes).
In all case, native endianness is used.
I'm trying to get the correct mental model for the thing that pack() returns.
pack is used to construct "binary data" for interfacing with external APIs.
I see pack as a serialization function. It takes as input Perl values, and outputs a serialized form. The fact the output serialized form happens to be a Perl bytestring is more of an implementation detail than a core functionality.
As such, all you're really expected to do with the resulting string is feed it to unpack, though the serialized form is convenient to have it move around processes, hosts, planets.
If you're interested in serializing it to a number instead, consider using vec:
say vec "BA", 0, 16; # prints 16961
To take a closer look at the string's internal representation, take a look at Devel::Peek, though you're not going to see anything surprising with a pure ASCII string.
use Devel::Peek;
Dump "BA";
SV = PV(0xb42f80) at 0xb56300
REFCNT = 1
FLAGS = (POK,READONLY,pPOK)
PV = 0xb60cc0 "BA"\0
CUR = 2
LEN = 16