When is it appropriate to use a simple modulus as a hashing function? - hash

I need to create a 16 bit hash from a 32 bit number, and I'm trying to determine if a simple modulus 2^16 is appropriate.
The hash will be used in a 2^16 entry hash table for fast lookup of the 32 bit number.
My understanding is that if the data space has a fairly even distribution, that a simple mod 2^16 is fine - it shouldn't result in too many collisions.
In this case, my 32 bit number is the result of a modified adler32 checksum, using 2^16 as M.
So, in a general sense, is my understanding correct, that it's fine to use a simple mod n (where n is hashtable size) as a hashing function if I have an even data distribution?
And specifically, will adler32 give a random enough distribution for this?

Yes, if your 32-bit numbers are uniformly distributed over all possible values, then a modulo n of those will also be uniformly distributed over the n possible values.
Whether the results of your modified checksum algorithm are uniformly distributed is an entirely different question. That will depend on whether the data you are applying the algorithm to has enough data to roll over the sums several times. If you are applying the algorithm to short strings that don't roll over the sums, then the result will not be uniformly distributed.
If you want a hash function, then you should use a hash function. Neither Adler-32 nor any CRC is a good hash function. There are many very fast and effective hash functions available in the public domain. You can look at CityHash.

Related

How can function to find Hamming Distance be accelerated for bigger datas in postgreSQL?

I have a potgreSQL data bank with more than 10,0000 entries and each entry has a bit array of size 10000. Is there any method to accelerate the Hamming Distance calculation of the bit arrays for the whole table. Thanks
i tried using different data types like bytea, text and numeric for saving bit array and for calculating hamming distance i tried XOR gate operations, text comparison and numeric addition respectively for each datatypes. But i could not optimize the function to make it super quick, currently it takes almost 2 sec for the operation. The target is 200 millisecond.
There is no possibilities to have good performances for hamming distance because it's a recursive process with a high algorithmic complexity and a very high memory footprint.
https://www.cs.swarthmore.edu/~brody/papers/random14-hamming-distance.pdf
It is not accurate to use it in some big datasets like RDBMS.
Some other comparing technics exists and have a lower complexity withour recursive process and with a minimal footprint... They are not as accurate as the Hamming Distance, but can do a good job, as the one I wrote :
See "inférence basique"
You can combine the two... First use inférence basique to reduce the set, second use hamming on some very few results...

Strange rand() behaviour in MATLAB

rand() does not seem to generate really random numbers. I have a simple program that returns a 6-digit number by calling :
for i=1:6
r=rand(1,1)
end
so I ran this 4-5 times yesterday. And saved the output. Today I opened MATLAB again and called the same function again 4-5 times. The same numbers were returned.
Why is this happening?
Should I provide a random seed or any other fix?
Thanks for any help!
To expand on #alexforrence's answer, rand and other related functions produce pseudo-random numbers (PRNs) that require an initial value to begin production. These numbers are not truly random since, following the initial seed, the numbers are produced via an algorithm, which is deterministic by its very nature.
However, being pseudo-random isn't necessarily a bad thing since models that use PRNs (e.g., Monte Carlo Methods) can generate portable, repeatable results across many users and platforms.
Additionally, the seed can be changed to create sets of random numbers and results that are statistically independent but also produce repeatable results.
For many scientific applications, this is very important.
Also, "true" random numbers (next paragraph) have a tendency to "clump" together and not evenly spread over their range for a small sampling of the space, which will degrade the performance of some methods that rely on stochastic processes.
There are methods to create "true-er" random numbers by the introduction of randomness from various analogue sources (e.g., hardware noise). These types of numbers are extremely important for cryptographically secure PRNs, where non-repeatability is an important feature (in contrast to the scientific usage). True random number generators require special hardware that leverages natural noise (e.g., quantum effects).
Although, it is important to remember that the total number of random numbers that can be generated and computationally used is limited by the precision of the numbers being used.
You can re-seed MATLAB with a pseudo-random seed using the rng function.
However, "reseeding the generator too frequently within a session is not a good idea because the statistical properties of your random numbers can be adversely affected" [src].
From the Mathworks documentation, you can use
rng('shuffle');
before calling rand to set a "random" seed (based on the current time). Setting the seed manually (either by not changing the seed at startup, by resetting using rng('default'), or setting the seed manually by rng(number)) allows you to exactly repeat previous behavior.

Correct way to generate random numbers

On page 3 of "Lecture 8, White Noise and Power Spectral Density" it is mentioned that rand and randn create Pseudo-random numbers. Please correct me if I am wrong: a sequence of random number is that which for the same seed, two sequences are never really exact.
Whereas, Pseudo-random numbers are deterministic i.e., two sequences are same if generated from the same seed.
How can I create random numbers and not pseudo-random numbers since I was under the impression that Matlab's rand and randn functions are used to generate identically independent random numbers? But, the slides mention that they create pseudo random numbers. Googling for creating of random numbers return rand and randn() functions.
The reason for distinguishing random numbers from pseudo-random numbers is that I need to compare performance of cryptography (A) random with white noise characteristics and (B) pseudo-random signal with white noise characteristic. So, (A) must be different from (B). I shall be grateful for any code and the correct way to generate random numbers and pseudo-random numbers.
Generation of "true" random numbers is a tricky exercise, you can check Wikipedia on RNG and the tests of randomness (http://en.wikipedia.org/wiki/Random_number_generation). This link offers RNG based on atmospheric noise (http://www.random.org/).
As mentioned above, it is really difficult (probably impossible) to create real random numbers with computer software. There are numerous projects on the internet that provide real random numbers that are generated by physical processes (for example the one Kostya mentioned). A Particularly interesting one is this from HU Berlin.
That being said, for experiments like the one you want to perform, Maltab's psedo RNGs are more than fine. Matlab's algorithms include Mersenne Twister which is one of the best known pseudo RNG (I would suggest you google the Mersenne Twister's properties). See Maltab rng documentation here.
Since you did not mention which type of system you want to simulate, one simple approach to solve your issue would be to use a good RNG (Mersenne Twister) for process A and a not-so-good for process B.

Matlab `corr` gives different results on the same dataset. Is floating-point calculation deterministic?

I am using Matlab's corr function to calculate the correlation of a dataset. While the results agree within the double point accuracy (<10^-14), they are not exactly the same even on the same computer for different runs.
Is floating-point calculation deterministic? Where is the source of the randomness?
Yes and no.
Floating point arithmetic, as in a sequence of operations +, *, etc. is deterministic. However in this case, linear algebra libraries (BLAS, LAPACK, etc) are most likely being used, which may not be: for example, matrix multiplication is typically not performed as a "triple loop" as some references would have you believe, but instead matrices are split up into blocks that are optimised for maximum performance based on things like cache size. Therefore, you will get different sequences of operations, with different intermediate rounding, which will give slightly different results. Typically, however, the variation in these results is smaller than the total rounding error you are incurring.
I have to admit, I am a little bit surprised that you get different results on the same computer, but it is difficult to know why without knowing what the library is doing (IIRC, Matlab uses the Intel BLAS libraries, so you could look at their documentation).

weighted distribution hash bits

I would like to ask, if there is weight distribution equation for hash function or not?
like in channel coding theory there was weight enumerator equation for reed-solmon which give you the number of words of wight i.
Thanks
If you mean cryptographic hash function, then certainly not. Ideally cryptographic hash function can have any value, so every word of a given length is possible under a cryptographic hash function.
Reed-Solomon codes are linear codes, and the minimal weight of each word is the distance of the code, and it is in no way similar to a hash function.