weighted distribution hash bits - hash

I would like to ask, if there is weight distribution equation for hash function or not?
like in channel coding theory there was weight enumerator equation for reed-solmon which give you the number of words of wight i.
Thanks

If you mean cryptographic hash function, then certainly not. Ideally cryptographic hash function can have any value, so every word of a given length is possible under a cryptographic hash function.
Reed-Solomon codes are linear codes, and the minimal weight of each word is the distance of the code, and it is in no way similar to a hash function.

Related

What is the difference between hash encoding and vector embeddings?

I read
Hashing is the transformation of arbitrary size input in the form of a fixed-size value. We use hashing algorithms to perform hashing operations i.e to generate the hash value of an input
And vector embeddings pretty much do the same that they convert an input into a vector of fixed dimension. Trying to understand the difference between them.
Hash encoding can use any function which can convert any string to a random unique number but while creating vector embeddings we try to use domain knowledge and context in which the string might have occurred in the corpus.

Encoding a probability distribution for a genetic algorithm

What are some simple and efficient ways to encode a probability distribution as a chromosome for a genetic/evolutionary algorithm?
It highly depends on the nature of the probability distribution you have in hand. As you know, a probability distribution is a mathematical function. Therefore, the properties of this function govern the representation of the probability distribution as a chromosome. For example, do you have a discrete probability distribution (which is encoded by a discrete list of the probabilities of the outcomes like tossing a coin) or a continuous probability distribution (which is applicable when the set of possible outcomes can take on values in a continuous range like the temperature on a given day).
As a simple instance, consider that you want to encode Normal distribution which is an important distribution in probability theory. This distrubution can be encoded as a two-dimensional chromosome in which the first dimension is the mean (Mu) and variance (Sigma^2). You can then calculate the probability using these two parameters. For other continuous probability distribution like Cauchy, you can follow the similar way.

Is MATLAB vector length a constant operation?

I assume that MATLAB vectors/matrices have some meta data about dim/size/lengths. So length(a) is supposed to be a very fast operation if a is of vector. Since MATLAB doc does not talk about complexity in general, do we have any way to confirm this?
You are correct. "Under the hood" MATLAB stores and maintains a size for all array types, and the length operator merely retrieves this value. It isn't quite a simple variable reference, because length has to look at all size dimensions and pick the largest, so it is O(n) in the number of dimensions.

Correct way to generate random numbers

On page 3 of "Lecture 8, White Noise and Power Spectral Density" it is mentioned that rand and randn create Pseudo-random numbers. Please correct me if I am wrong: a sequence of random number is that which for the same seed, two sequences are never really exact.
Whereas, Pseudo-random numbers are deterministic i.e., two sequences are same if generated from the same seed.
How can I create random numbers and not pseudo-random numbers since I was under the impression that Matlab's rand and randn functions are used to generate identically independent random numbers? But, the slides mention that they create pseudo random numbers. Googling for creating of random numbers return rand and randn() functions.
The reason for distinguishing random numbers from pseudo-random numbers is that I need to compare performance of cryptography (A) random with white noise characteristics and (B) pseudo-random signal with white noise characteristic. So, (A) must be different from (B). I shall be grateful for any code and the correct way to generate random numbers and pseudo-random numbers.
Generation of "true" random numbers is a tricky exercise, you can check Wikipedia on RNG and the tests of randomness (http://en.wikipedia.org/wiki/Random_number_generation). This link offers RNG based on atmospheric noise (http://www.random.org/).
As mentioned above, it is really difficult (probably impossible) to create real random numbers with computer software. There are numerous projects on the internet that provide real random numbers that are generated by physical processes (for example the one Kostya mentioned). A Particularly interesting one is this from HU Berlin.
That being said, for experiments like the one you want to perform, Maltab's psedo RNGs are more than fine. Matlab's algorithms include Mersenne Twister which is one of the best known pseudo RNG (I would suggest you google the Mersenne Twister's properties). See Maltab rng documentation here.
Since you did not mention which type of system you want to simulate, one simple approach to solve your issue would be to use a good RNG (Mersenne Twister) for process A and a not-so-good for process B.

When is it appropriate to use a simple modulus as a hashing function?

I need to create a 16 bit hash from a 32 bit number, and I'm trying to determine if a simple modulus 2^16 is appropriate.
The hash will be used in a 2^16 entry hash table for fast lookup of the 32 bit number.
My understanding is that if the data space has a fairly even distribution, that a simple mod 2^16 is fine - it shouldn't result in too many collisions.
In this case, my 32 bit number is the result of a modified adler32 checksum, using 2^16 as M.
So, in a general sense, is my understanding correct, that it's fine to use a simple mod n (where n is hashtable size) as a hashing function if I have an even data distribution?
And specifically, will adler32 give a random enough distribution for this?
Yes, if your 32-bit numbers are uniformly distributed over all possible values, then a modulo n of those will also be uniformly distributed over the n possible values.
Whether the results of your modified checksum algorithm are uniformly distributed is an entirely different question. That will depend on whether the data you are applying the algorithm to has enough data to roll over the sums several times. If you are applying the algorithm to short strings that don't roll over the sums, then the result will not be uniformly distributed.
If you want a hash function, then you should use a hash function. Neither Adler-32 nor any CRC is a good hash function. There are many very fast and effective hash functions available in the public domain. You can look at CityHash.