Hash function for an integer sequence - hash

Say there is a list of permutations. Each permutation is a long list of integers. Let's consider a sample permutatation and call it samplePerm. My task is to find out if the list contains the samplePerm. I think that it will be a good idea to use a hash function technique. So that permutations are very large (more than 10000 items) the polinomial variant (like for strings) is useless. Does anybody know the best practice?
UPDATE:
THE ORDER OF INTEGERS IN A PERMUTATION IS A KEY CRITERION! All permutations consist of the same numbers

The solution is dividing integers into groups and considering each group as a string via concatenating integers. After that it is possible to apply a hash function (see java String.hashCode() for an algorithm) to each group. Finally it is possible to add the result numbers. The last activity may provide collisions so it is a place where it is required a better idea :)

Related

Hash UUIDs without requiring ordering

I have two UUIDs. I want to hash them perfectly to produce a single unique value, but with a constraint that f(m,n) and f(n,m) must generate the same hash.
UUIDs are 128-bit values
the hash function should have no collisions - all possible input pairings must generate unique hash values
f(m,n) and f(n,m) must generate the same hash - that is, ordering is not important
I'm working in Go, so the resulting value must fit in a 256-bit int
the hash does not need to be reversible
Can anyone help?
Concatenate them with the smaller one first.
To build on user2357112's brilliant solution and boil down the comment chain, let's consider your requirements one by one (and out of order):
No collisions
Technically, that's not a hash function. A hash function is about mapping heterogeneous, arbitrary length data inputs into fixed-width, homogenous outputs. The only way to accomplish that if the input is longer than the output is through some data loss. For most applications, this is tolerable because the hash function is only used as a fast lookup key and the code falls back onto the slower, complete comparison of the data. That's why many guides and languages insist that if you implement one, you must implement the other.
Fortunately, you say:
Two UUID inputs m and n
UUIDs are 128 bits each
Output of f(m,n) must be 256 bits or less
Combined your two inputs are exactly 256 bits, which means you do not have to lose any data. If you needed a smaller output, then you would be out of luck. As it is, you can concatenate the two numbers together and generate a perfect, unique representation.
f(m,n) and f(n,m) must generate the same hash
To accomplish this final requirement, make a decision on the concatenation order by some intrinsic value of the two UUIDs. The suggested smaller-first works just great. However...
The hash does not need to be reversible
If you specifically need irreversible hashing, that's a different question entirely. You could still use the less-than comparison to ensure order independence when feeding to a cryptographically hash function, but you would be hard pressed to find something that guaranteed no collisions even with fixed-width inputs a 256 bit output width.

Hash function for an array of integers

What would be a good hash function for an array of integers?
For example I have two arrays [1,2,3] and [1,5]. What hash functions should I adopt to seperate both these arrays?
I thought of adding up each element after raising it to the power of 2 but this has a large cost associated with it due to multiple multiplications. Is there any simple hashing function for this scenario?
For that particular data set, just subtract one from the second-to-last item, that will give you a perfect minimal hash, with buckets 0 and 1 produced :-)
More seriously, the choice of a good hashing function does depend a great deal on the sort of data so that should be taken into consideration. It's hard to suggest something without knowing the properties of the data you'll be storing.
I would start simply by choosing an arbitrary function such as adding all the items in the array then adding the array length to that, and reducing it modulo some value:
numbuckets = 97
bucket = array.length() % numbuckets
for index in range (array.length()):
bucket = (bucket + array[index]) % numbuckets
Then examine the results (across a great many real data sets) to make sure there's not too many collisions. If there are, choose another function.
It's the same as with optimisation: measure, don't guess! Actually monitor the collisions and usage and act if it gets bad.

Efficient Function to Map (or Hash) Integers and Integer Ranges into Index

We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.

How to calculate equal hash for similar strings?

I create Antiplagiat. I use a shingle method. For example, I have the following shingles:
I go to the cinema
I go to the cinema1
I go to th cinema
Is there a method of calculating the equal hash for these lines?
I know of the existence of Levenshtein distance. However, I do not know what I should take source word. Maybe there is a better way than to consider Levenshtein distance.
The problem with hashing is that, logically, you'll run into 2 strings that differ by a single character that hash to different values.
Small proof:
Consider all possible strings.
Assume all of these hash to at least 2 different values.
Take any 2 strings A and B that hash to different values.
You can obviously go from A to B by just changing one character at a time.
Thus at some point the hash will change.
Thus at this point the hash will be different for a single character change.
Some options I can think of:
Hash multiple parts of the string and check each of these hashes. Probably won't work too well since a single character omission will cause significant difference in the hash values.
Check a range of hashes. A hash is one dimensional, but string similarity is not, thus this probably won't work either.
All in all, hashing is probably not the way to go.
This questions is a bit old but you may be interested in this paper by two researchers at AT&T. They employ a technique that is reminiscent of the Nilsimsa hash to detect when similar sms messages have been seen an "abnormal" number of times in a time window.
It sounds Locality Sensitive hashing would also be pertinent to your problem.

Checking for string matches using hashes, without double-checking the entire string

I'm trying to check if two strings are identical as quickly as possible. Can I protect myself from hash collisions without also comparing the entire string?
I've got a cache of items that are keyed by a string. I store the hash of the string, the length of the string, and the string itself. (I'm currently using djb2 to generate the hash.)
To check if an input string is a match to an item in the cache, I compute the input's hash, and compare it to the stored hash. If that matches, I compare the length of the input (which I got as a side effect of computing the hash) to the stored length. Finally, if that matches, I do a full string comparison of the input and the stored string.
Is it necessary to do that full string comparison? For example, is there a string hashing algorithm that can mathematically guarantee that no two strings of the same length will generate the same hash? If not, can an algorithm guarantee that two different strings of the same length will generate different hash codes if any of the first N characters differ?
Basically, any string comparison scheme that offers O(1) performance when the strings differ but better than O(n) performance when they match would be an improvement over what I'm doing now.
For example, is there a string hashing algorithm that can mathematically guarantee that no two strings of the same length will generate the same hash?
No, and there can't be. Think about it: The hash has a finite length, but the strings do not. Say for argument's sake that the hash is 32-bits. Can you create more than 2 billion unique strings with the same length? Of course you can - you can create an infinite number of unique strings, so comparing the hashes is not enough to guarantee uniqueness. This argument scales to longer hashes.
If not, can an algorithm guarantee that two different strings of the same length will generate different hash codes if any of the first N characters differ?
Well, yes, as long as the number of bits in the hash is as great as the number of bits in the string, but that's probably not the answer you were looking for.
Some of the algorithms used for cyclic redundancy checks have guarantees like if there's exactly one bit different then the CRC is guaranteed to be different over a certain run length of bits, but that only works for relatively short runs.
You should be safe from collisions if you use a modern hashing function such as one of the Secure Hash Algorithm (SHA) variants.