I'm in the process of integrating a hash method (farmhash) to our software base. The hashing services seem to work appropriately. Basically, it turns a string of characters into an unique-ish integer value.
I've added an infrastructure to detect collisions (in a case where two input strings would result in the same output integer). Basically, for each string that is hashed, I keep the [hash result] -> [string] in a map, and every time a new string is hashed, I compare it to what's in the map; if the hash is already there, I make sure that it is the same string that has generated it. I am aware that it's potentially slow and it's potentially memory consuming, but I'm performing theses checks only on a "per request" basis: they are not enabled in release mode.
Now I'd like to test that infrastructure (as in get a collision, from a unit test point of view).
I could generate a bunch of strings (random or sequential), spam my hash infrastructure and hope to see a positive collision but I feel I'll waste my time, CPU cycles and fill the memory with a load of data without success.
How would one go about generating collisions?
Not-so-relevant-facts:
I'm using c++;
I can generate data using python;
The target int is uint32_t.
Update:
I have created a small naive program to brute force the detection of collision:
void
addToQueue(std::string&& aString)
{
//std::cout << aString << std::endl;
hashAndCheck( aString ); // Performs the hash and check if there is a collision
if ( mCount % 1000000 )
std::cout << "Did " << mCount << " checks so far" << std::endl;
mQueue.emplace( aString );
}
void
generateNextRound( const std::string& aBase )
{
//48 a 122 incl
for ( int i = 48; i <= 122; i++ )
{
addToQueue( std::move( std::string( aBase ).append( 1, static_cast<char>( i ) ) ) );
}
}
int main( void )
{
// These two generate a collision
//StringId id2 = HASH_SID( "#EF" ); // Hashes only, does not check
//StringId id1 = HASH_SID( "7\\:" ); // Hashes only, does not check
std::string base = "";
addToQueue( std::move( base ) );
while ( true )
{
const std::string val = mQueue.front();
mQueue.pop();
generateNextRound( val );
}
return 0;
}
I could eventually have added threading and stuff in there but I didn't need it because I found a collision in about 1 second (in debug mode).
If you brute force search for collisions offline, you could hard code strings that cause collisions into your test so that your test is as close to production code as possible, but doesn't suffer the performance penalty of doing the brute force work each time (or, like other people have said, you can make an intentionally junky hash algorithm that causes excessive collisions)
You could limit the range of the integer that is outputted by the hash function; in general you should be able to pass some number into it (n) so that results will lie between 0 & n-1. If you limit it to 10 say, Then you'll definitely end up with collisions.
For key k and hash function h, return a constant c:
h(k) = c
This always collides, regardless of what key you use.
Related
If one sets the hash table seed during resize or table creation to a random number, will that prevent the DDoS attacks on such hash table or, knowing the hash algorithm, the attacker will still easily get around the seed? What if the algorithm uses the Pearson hash function with randomly generated tables, unknown to the attacker? Does such table hash still need a seed or it is safe enough?
Context: I want to use an on-disk hash table for a key-value database for my toy web server, where the keys may depend on the user input.
There is exist several approaches to protect your hash-subsystem from "adverse selection" attack, most popular of them is named Universal Hashing, where hash-function or it's property randomly selected, at initialization.
In my own approach, I am using same hash function, where each char adding to result with non-linear mixing, dependends of random array of uint32_t[256]. Array is created during system initialization, and in my code, it happening at each start, by reading the /dev/urandom. See my implementation in open source emerSSL program. You're welcome for borrow this entire hash-table implementation, or hash-function only.
Currently, my hash-function from the referred source computes two independent hashes for double hashing search algorithm.
There is "reduced" hash-function form the source, to demonstrate idea of non-linear mixing with S-block array"
uint32_t S_block[0x100]; // Substitute block, random contains
#define NLF(h, c) (S_block[(unsigned char)(c + h)] ^ c)
#define ROL(x, n) (((x) << (n)) | ((x) >> (32 - (n))))
int32_t hash(const char *key) {
uint32_t h = 0x1F351F35; // Barker code * 2
char c;
for(int i = 0; c = key[i]; i++) {
h = ROL(h, 5);
h += NLF(h, c);
}
return h;
}
I'm looking for a simple hash function that doesn't rely on integer overflow, and doesn't rely on unsigned integers.
The problem is that I have to create the hash function in blueprint from Unreal Engine (only has signed 32 bit integer, with undefined overflow behavior) and in PHP5, with a version that uses 64 bit signed integers.
So when I use the 'common' simple hash functions, they don't give the same result on both platforms because they all rely on bit-overflowing behavior of unsigned integers.
The only thing that is really important is that is has good 'randomness'. Does anyone know something simple that would accomplish this?
It's meant for a very basic signing symstem for sending messages to a server. Doesn't need to be top security... it's for storing high scores of a simple game on a server. The idea is that I would generate several hash-integers from the message (using different 'start numbers') and append them to make a hash-signature ). I just need to make sure that if people sniff the network messages send to the server that they cannot easily send faked messages. They would need to provide the correct hash-signature with their message, which they shouldn't be able to do unless they know the hash function being used. Ofcourse if they reverse engineer the game they can still 'hack' it, but I wouldn't know how to counter that...
I have no access to existing hash functions in the unreal engine blueprint system.
The first thing I would try would be to simulate the behavior of unsigned integers using signed integers, by explicitly applying the modulo operator whenever the accumulated hash-value gets large enough that it might risk overflowing.
Example code in C (apologies for the poor hash function, but the same technique should be applicable to any hash function, at least in principle):
#include <stdio.h>
#include <string.h>
int hashFunction(const char * buf, int numBytes)
{
const int multiplier = 33;
const int maxAllowedValue = 2147483648-256; // assuming 32-bit ints here
const int maxPreMultValue = maxAllowedValue/multiplier;
int hash = 536870912; // arbitrary starting number
for (int i=0; i<numBytes; i++)
{
hash = hash % maxPreMultValue; // make sure hash cannot overflow in the next operation!
hash = (hash*multiplier)+buf[i];
}
return hash;
}
int main(int argc, char ** argv)
{
while(1)
{
printf("Enter a string to hash:\n");
char buf[1024]; fgets(buf, sizeof(buf), stdin);
printf("Hash code for that string is: %i\n", hashFunction(buf, strlen(buf)));
}
}
For example, if you're programming in Java, and you want to create a 64-bit hash function for an arbitrary object, does it make sense to apply something like murmurHash3's 'finalizer' to the result of Object.hashCode()?
Specifically, is the following hash function
long Mix(int i)
{
long result = i;
return result ^ (result << 32) ^ (result << 33); // Or some 'better' way of mixing up the bits of i.
}
long Hash(Object o)
{
return Mix(o.hashCode());
}
better than simply doing
long Hash(Object o)
{
return o.hashCode();
}
(I'm well aware that the second one gives you nothing over a 32-bit hash)
The hash is going to be used to implement (recursive) hash-join, and the buckets are going to be determined by doing hash % prime. A concern is that it's going to be hard to make a good sequence of independent hash functions for the 'recursive' part if we only have 32-bits to start out with.
I'm thinking the answer is 'no', and that you really need to start out with a 64-bit hash which was computed directly from the value of the object.
I guess a side question is whether you actually need a 64-bit hash in the first place for the purposes of hash-join.
The following is very slow for long strings:
std::string s = "long string";
K klist = DBVec::CreateList(KG , s.length());
for (int i=0; i<s.length(); i++)
{
kG(klist)[i]=s.c_str()[i];
}
It works acceptably fast (<100ms) for strings up to 100k, but slows to a crawl (tens of minutes, possibly hours) for strings of a few million characters. I don't see anything other than kG that can create nonlinearity. I don't see any reason for accessor function kG to be non-constant time, but there is just nothing else in this loop. Unfortunately I don't know how kG works due to lack of documentation.
Question: given a blob of binary data as std::string, what's the efficient way to construct a byte list?
kG is a macro defined in k.h which expands to ((x)->G0), i.e. follow the G0 pointer of the K object
http://kx.com/q/d/a/c.htm#Strings documents kp, which creates a K string object directly from a string, so presumably you could do K klist = kp(s.c_str()), which is probably faster
This works:
memcpy(kG(klist), s.c_str(), s.length());
Still wonder why that loop is not O(N).
Can any one give an example of 2 strings, consisting of alphabetical characters only, that will produce the same hash value with ELFHash?
I need these to test my codes. But it doesn't seem like easy to produce. And to my surprise there there are a lot of example codes of various hash function on the internet but none of them provides examples of collided strings.
Below is the ELF Hash, in case you need it.
unsigned int ELFHash(const std::string& str)
{
unsigned int hash = 0;
unsigned int x = 0;
for(std::size_t i = 0; i < str.length(); i++)
{
hash = (hash << 4) + str[i];
if((x = hash & 0xF0000000L) != 0)
{
hash ^= (x >> 24);
hash &= ~x;
}
}
return (hash & 0x7FFFFFFF);
}
You can find collisions using a brute force method (e.g. compute all possible strings with length lower than 5).
Some example of collisions (that I got in that way):
hash = 23114:
-------------
UMz
SpJ
hash = 4543841:
---------------
AAAAQ
AAABA
hash = 5301994:
---------------
KYQYZ
KYQZJ
KYRIZ
KYRJJ
KZAYZ