Does 'mixing' the result of a 32-bit hash to create a 64-bit hash have any value? - hash

For example, if you're programming in Java, and you want to create a 64-bit hash function for an arbitrary object, does it make sense to apply something like murmurHash3's 'finalizer' to the result of Object.hashCode()?
Specifically, is the following hash function
long Mix(int i)
{
long result = i;
return result ^ (result << 32) ^ (result << 33); // Or some 'better' way of mixing up the bits of i.
}
long Hash(Object o)
{
return Mix(o.hashCode());
}
better than simply doing
long Hash(Object o)
{
return o.hashCode();
}
(I'm well aware that the second one gives you nothing over a 32-bit hash)
The hash is going to be used to implement (recursive) hash-join, and the buckets are going to be determined by doing hash % prime. A concern is that it's going to be hard to make a good sequence of independent hash functions for the 'recursive' part if we only have 32-bits to start out with.
I'm thinking the answer is 'no', and that you really need to start out with a 64-bit hash which was computed directly from the value of the object.
I guess a side question is whether you actually need a 64-bit hash in the first place for the purposes of hash-join.

Related

Seed for hash-table non cryptographic hash functions

If one sets the hash table seed during resize or table creation to a random number, will that prevent the DDoS attacks on such hash table or, knowing the hash algorithm, the attacker will still easily get around the seed? What if the algorithm uses the Pearson hash function with randomly generated tables, unknown to the attacker? Does such table hash still need a seed or it is safe enough?
Context: I want to use an on-disk hash table for a key-value database for my toy web server, where the keys may depend on the user input.
There is exist several approaches to protect your hash-subsystem from "adverse selection" attack, most popular of them is named Universal Hashing, where hash-function or it's property randomly selected, at initialization.
In my own approach, I am using same hash function, where each char adding to result with non-linear mixing, dependends of random array of uint32_t[256]. Array is created during system initialization, and in my code, it happening at each start, by reading the /dev/urandom. See my implementation in open source emerSSL program. You're welcome for borrow this entire hash-table implementation, or hash-function only.
Currently, my hash-function from the referred source computes two independent hashes for double hashing search algorithm.
There is "reduced" hash-function form the source, to demonstrate idea of non-linear mixing with S-block array"
uint32_t S_block[0x100]; // Substitute block, random contains
#define NLF(h, c) (S_block[(unsigned char)(c + h)] ^ c)
#define ROL(x, n) (((x) << (n)) | ((x) >> (32 - (n))))
int32_t hash(const char *key) {
uint32_t h = 0x1F351F35; // Barker code * 2
char c;
for(int i = 0; c = key[i]; i++) {
h = ROL(h, 5);
h += NLF(h, c);
}
return h;
}

Verifying programs with heterogeneous arrays in VST

I'm verifying a c program that uses arrays to store heterogeneous data - in particular, the program uses arrays to implement cons cells, where the first element of the array is an integer value, and the second element is a pointer to the next cons cell.
For example, the free operation for this list would be:
void listfree(void * x) {
if((x == 0)) {
return;
} else {
void * n = *((void **)x + 1);
listfree(n);
free(x);
return;
}
}
Note: Not shown here, but other code sections will read the values of the array and treat it as an integer.
While I understand that the natural way to express this would be as some kind of struct, the program itself is written using an array, and I can't change this.
How should I specify the structure of the memory in VST?
I've defined an lseg predicate as follows:
Fixpoint lseg (x: val) (s: (list val)) (self_card: lseg_card) : mpred := match self_card with
| lseg_card_0 => !!(x = nullval) && !!(s = []) && emp
| lseg_card_1 _alpha_513 =>
EX v : Z,
EX s1 : (list val),
EX nxt : val,
!!(~ (x = nullval)) &&
!!(s = ([(Vint (Int.repr v))] ++ s1)) &&
(data_at Tsh (tarray tint 2) [(Vint (Int.repr v)); nxt] x) *
(lseg nxt s1 _alpha_513)
end.
However, I run into troubles when trying to evaluate void *n = *(void **)x; presumably because the specification states that the memory contains an array of ints not pointers.
The issue is probably as follows, and can almost be solved as follows.
The C semantics permit casting an integer (of the right size) to a pointer, and vice versa, as long as you don't actually do any pointer operations to an integer value, or vice versa. Very likely your C program obeys those rules. But the type system of Verifiable C tries to enforce that local variables (and array elements, etc.) of integer type will never contain pointer values, and vice versa (except the special integer value 0, which is NULL).
However, Verifiable C does support a (proved-foundationally-sound) workaround to this stricter enforcement:
typedef void * int_or_ptr
#ifdef COMPCERT
__attribute((aligned(_Alignof(void*))))
#endif
;
That is: the int_or_ptr type is void*, but with the attribute "align this as void*". So it's semantically identical to void*, but the redundant attribute is a hint to the VST type system to be less restrictive about C type enforcement.
So, when I say "can almost be solved", I'm asking: Can you modify the C program to use an array of "void* aligned as void*" ?
If so, then you can proceed. Your VST verification should use int_or_ptr_type, which is a definition of type Ctypes.type provided by VST-Floyd, when referring to the C-language type of these array elements, or of local variables that these elements are loaded into.
Unfortunately, int_or_ptr_type is not documented in the reference manual (VC.pdf), which is an omission that should be correct. You can look at progs/int_or_ptr.c and progs/verif_int_or_ptr.v, but these do much more than you want or need: They axiomatize operators that distinguish odd integers from aligned pointers, which is undefined in C11 (but consistent with C11, otherwise the ocaml garbage collector could never work). That is, those axiomatized external functions are consistent with CompCert, gcc, clang; but you won't need any of them, because the only operations you're doing on int_or_pointer are the perfectly-legal "comparison with NULL" and "cast to integer" or "cast to struct foo *".

What is the point of writing integer in hexadecimal, octal and binary?

I am well aware that one is able to assign a value to an array or constant in Swift and have those value represented in different formats.
For Integer: One can declare in the formats of decimal, binary, octal or hexadecimal.
For Float or Double: One can declare in the formats of either decimal or hexadecimal and able to make use of the exponent too.
For instance:
var decInt = 17
var binInt = 0b10001
var octInt = 0o21
var hexInt = 0x11
All of the above variables gives the same result which is 17.
But what's the catch? Why bother using those other than decimal?
There are some notations that can be way easier to understand for people even if the result in the end is the same. You can for example think in cases like colour notation (hexadecimal) or file permission notation (octal).
Code is best written in the most meaningful way.
Using the number format that best matches the domain of your program, is just one example. You don't want to obscure domain specific details and want to minimize the mental effort for the reader of your code.
Two other examples:
Do not simplify calculations. For example: To convert a scaled integer value in 1/10000 arc minutes to a floating point in degrees, do not write the conversion factor as 600000.0, but instead write 10000.0 * 60.0.
Chose a code structure that matches the nature of your data. For example: If you have a function with two return values, determine if it's a symmetrical or asymmetrical situation. For a symmetrical situation always write a full if (condition) { return A; } else { return B; }. It's a common mistake to write if (condition) { return A; } return B; (simply because 'it works').
Meaning matters!

How can I make a good hash function without unsigned integers?

I'm looking for a simple hash function that doesn't rely on integer overflow, and doesn't rely on unsigned integers.
The problem is that I have to create the hash function in blueprint from Unreal Engine (only has signed 32 bit integer, with undefined overflow behavior) and in PHP5, with a version that uses 64 bit signed integers.
So when I use the 'common' simple hash functions, they don't give the same result on both platforms because they all rely on bit-overflowing behavior of unsigned integers.
The only thing that is really important is that is has good 'randomness'. Does anyone know something simple that would accomplish this?
It's meant for a very basic signing symstem for sending messages to a server. Doesn't need to be top security... it's for storing high scores of a simple game on a server. The idea is that I would generate several hash-integers from the message (using different 'start numbers') and append them to make a hash-signature ). I just need to make sure that if people sniff the network messages send to the server that they cannot easily send faked messages. They would need to provide the correct hash-signature with their message, which they shouldn't be able to do unless they know the hash function being used. Ofcourse if they reverse engineer the game they can still 'hack' it, but I wouldn't know how to counter that...
I have no access to existing hash functions in the unreal engine blueprint system.
The first thing I would try would be to simulate the behavior of unsigned integers using signed integers, by explicitly applying the modulo operator whenever the accumulated hash-value gets large enough that it might risk overflowing.
Example code in C (apologies for the poor hash function, but the same technique should be applicable to any hash function, at least in principle):
#include <stdio.h>
#include <string.h>
int hashFunction(const char * buf, int numBytes)
{
const int multiplier = 33;
const int maxAllowedValue = 2147483648-256; // assuming 32-bit ints here
const int maxPreMultValue = maxAllowedValue/multiplier;
int hash = 536870912; // arbitrary starting number
for (int i=0; i<numBytes; i++)
{
hash = hash % maxPreMultValue; // make sure hash cannot overflow in the next operation!
hash = (hash*multiplier)+buf[i];
}
return hash;
}
int main(int argc, char ** argv)
{
while(1)
{
printf("Enter a string to hash:\n");
char buf[1024]; fgets(buf, sizeof(buf), stdin);
printf("Hash code for that string is: %i\n", hashFunction(buf, strlen(buf)));
}
}

Specman: Why DAC macro interprets the type <some_name'exp> as 'string'?

I'm trying to write a DAC macro that gets as input the name of list of bits and its size, and the name of integer variable. Every element in the list should be constrained to be equal to every bit in the variable (both of the same length), i.e. (for list name list_of_bits and variable name foo and their length is 4) the macro's output should be:
keep list_of_bits[0] == foo[0:0];
keep list_of_bits[1] == foo[1:1];
keep list_of_bits[2] == foo[2:2];
keep list_of_bits[3] == foo[3:3];
My macro's code is:
define <keep_all_bits'exp> "keep_all_bits <list_size'exp> <num'name> <list_name'name>" as computed {
for i from 0 to (<list_size'exp> - 1) do {
result = appendf("%s keep %s[%d] == %s[%d:%d];",result, <list_name'name>, index, <num'name>, index, index);
};
};
The error I get:
*** Error: The type of '<list_size'exp>' is 'string', while expecting a
numeric type
...
for i from 0 to (<list_size'exp> - 1) do {
Why it interprets the <list_size'exp> as string?
Thank you for your help
All macro arguments in DAC macros are considered strings (except repetitions, which are considered lists of strings).
The point is that a macro treats its input purely syntactically, and it has no semantic information about the arguments. For example, in case of an expression (<exp>) the macro is unable to actually evaluate the expression and compute its value at compilation time, or even to figure out its type. This information is figured out at later compilation phases.
In your case, I would assume that the size is always a constant. So, first of all, you can use <num> instead of <exp> for that macro argument, and use as_a() to convert it to the actual number. The difference between <exp> and <num> is that <num> allows only constant numbers and not any expressions; but it's still treated as a string inside the macro.
Another important point: your macro itself should be a <struct_member> macro rather than an <exp> macro, because this construct itself is a struct member (namely, a constraint) and not an expression.
And one more thing: to ensure that the list size will be exactly as needed, add another constraint for the list size.
So, the improved macro can look like this:
define <keep_all_bits'struct_member> "keep_all_bits <list_size'num> <num'name> <list_name'name>" as computed {
result = appendf("keep %s.size() == %s;", <list_name'name>, <list_size'num>);
for i from 0 to (<list_size'num>.as_a(int) - 1) do {
result = appendf("%s keep %s[%d] == %s[%d:%d];",result, <list_name'name>, i, <num'name>, i, i);
};
};
Why not write is without macro?
keep for each in list_of_bits {
it == foo[index:index];
};
This should do the same, but look more readable and easier to debug; also the generation engine might take some advantage of more concise constraint.