I have 3 non-negative integers and a number n such that
0 <= a <= n, 0 <= b <= n, and 0 <= c <= n.
I need a one-way hash function that maps these 3 integers to one integer (could be any integer, positive or negative). Is there a way to do this, and if so, how? Is there a way so that this this function can be expressed as a simple mathematical expression where the only parameters are a, b, c, and n?
Note: I need this function because I was using tuples of 3 integers as keys in a dictionary on python, and with upwards of 10^10 keys, space is a real issue.
How about the Cantor pairing function (https://en.wikipedia.org/wiki/Pairing_function#Cantor_pairing_function)?
Let
H(a,b) := .5*(a + b)*(a + b + 1) + b
then
H(a,b,c) := .5*(H(a,b) + c)*(H(a,b) + c + 1) + c
You mentioned that you need a one-way hash, but based on your detailed description about memory constraints it seems that an invertible hash would also suffice.
This doesn't use the assumption that a, b, and c are bounded above and below.
Augmenting the answer above for a more concise implementation:
int cantor(int a, int b) {
return (a + b + 1) * (a + b) / 2 + b;
}
int hash(int a, int b, int c) {
return cantor(a, cantor(b, c));
}
The easiest way to understand this is that Cantor algorithm assigns a natural number to every integer pair of numbers.
Once we've assigned a natural number N = cantor(b, c), then we can assign a new unique natural number M = cantor(a, N), which we can use as a hash code and is a unique natural number for every triple a, b, c.
As a more general case, you could hash more integers by just another cantor with the next integer (e.g. cantor(a, cantor(b, cantor(c, d)))).
Related
I am currently reading about the Rabin Karp algorithm and as part of that I need to understand string polynomial hashing. From what I understand, the hash of a string is given by the following formula:
hash = ( char_0_val * p^0 + char_1_val * p^1 + ... + char_n_val ^ p^n ) mod m
Where:
char_i_val: is the integer value of the character plus 1 given by string[i]-'a' + 1
p is a prime number larger than the character set
m is a large prime number
The website cp-algorithms has the following entry on the subject. They say that the code to write the above is as follows:
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
hash_value = (hash_value + (c - 'a' + 1) * p_pow) % m;
p_pow = (p_pow * p) % m;
}
return hash_value;
}
I understand what the program is trying to do but I do not understand why it is correct.
My question
I am having trouble understanding why the above code is correct. It has been a long time since I have done any modular math. After searching online I see that we have the following formulas for modular addition and modular multiplication:
a+b (mod m) = (a%m + b%m)%m
a*b (mod m) = (a%m * b%m)%m
Based on the above shouldn't the code be as follows?
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
int char_value = (c - 'a' + 1);
hash_value = (hash_value%m + ((char_value%m * p_pow%m)%m)%m ) % m;
p_pow = (p_pow%m * p%m) % m;
}
return hash_value;
}
What am I missing? Ideally I am seeking a breakdown of the code and an explanation of why the first version is correct.
Mathematically, there is no reason to reduce intermediate results modulo m.
Operationally, there are a couple of very closely related reasons to do it:
To keep numbers small enough that they can be represented efficiently.
To keep numbers small enough that operations on them do not overflow.
So let's look at some quantities and see if they need to be reduced.
p was defined as some value less than m, so p % m == p.
p_pow and hash_value have already been reduced modulo m when they were computed, reducing them modulo m again would do nothing.
char_value is at most 26, which is already less than m.
char_value * p_pow is at most 26 * (m - 1). That can be, and often will be, more than m. So reducing it modulo m would do something. But it can still be delayed, because the next step is still "safe" (no overflow)
char_value * p_pow + hash_value is still at most 27 * (m - 1) which is still much less than 263-1 (the maximum value for a long long, see below why I assume that a long long is 64-bit), so there is no problem yet. It's fine to reduce modulo m after the addition.
As a bonus, the loop could actually do (263-1) / (27 * (m - 1)) iterations before it needs to reduce hash_value modulo m. That's over 341 million iterations! For most practical purposes you could therefore remove the first % m and return hash_value % m; instead.
I used 263-1 in this calculation because p_pow = (p_pow * p) % m requires long long to be a 64-bit type (or, hypothetically, an exotic size of 36 bits or higher). If it was a 32-bit type (which is technically allowed, but rare nowadays) then the multiplication could overflow, because p_pow can be approximately a billion and a 32-bit type cannot hold 31 billion.
BTW note that this hash function is specifically for strings that only contain lower-case letters and nothing else. Other characters could result in a negative value for char_value which is bad news because the remainder operator % in C++ works in a way such that for negative numbers it is not the "modulo operator" (misnomer, and the C++ specification does not call it that). A very similar function can be written that can take any string as input, and that would change the analysis above a little bit, but not qualitatively.
As I am new on MatLab and Mathematica, I am trying to solve two (easy) problems using one of these two programmes.
"In number theory, Lagrange’s four-square theorem, states that every natural number n can be written as n= a^2+ b^2 + c^2 + d^2, where a, b, c, d are integers.
Given a natural number n, display all possible integers a, b, c, d.
The number of ways to write a natural number
n as the sum of four squares is denoted by r4(n). Using Jacobi's theorem, plot the function r4(n)
and compare it with the function 8n√(log n)."
This is a partial answer using Mathematica build-in functions
PowerRepresentations[n,k,p] gives the distinct representation of the integer n as sum of k non-negative p th integer powers.
Attention: by distinct we mean: if n = n1^p + n2^p + n3^p ... the function returns k-tuples such that n1<=n2<=n3...
Example:
PowerRepresentation[20,4,2]
gives
{{0,0,2,4},{1,1,3,3}}
To get the number of possible representations of integer n as a sum of d squares you can use the SquaresR[d,n] function (your rd(n) functions).
Example:
SquaresR[4,20]
prints
144
However as you explained there is still some works because rd(n) also returns negative solutions and permuted ones.
For instance:
SquaresR[2,20]
returns
8
You must understand 8 as counting without distinction:
4 sign changes:
{2,4},{2,-4},{-2,4},{-2,-4}
times
2 permutations
{2,4},{4,2}
Given an integer m, a hash function defined on T is a map T -> {0, 1, 2, ..., m - 1}. If k is an element of T and m is a positive integer, we denote hash(k, m) its hashed value.
For simplicity, most hash functions are of the form hash(k, m) = f(k) % m where f is a map from T to the set of integers.
In the case where m = 2^p (which is often used to the modulo m operation is cheap) and T is a set of integers, I have seen many people using f(k) = c * k with c being a prime number.
I understand if you want to choose a function of the form f(k) = c * k, you need to have gcd(c, m) = 1 for every hash table size m. Even though using a prime number fits the bill, c = 1 is also good.
So my question is the following: why do people still use f(k) = prime * k as their hash function? What kind of nice property does it have?
You don't need it to be prime. One of the most efficient hash functions with provable collision resistance just multiplies with a random number: https://en.wikipedia.org/wiki/Universal_hashing#Avoiding_modular_arithmetic. You do however need it to be odd.
Given 2 integers a and b (positive or negative). Is there any formula / method for generating unique ID number?
note: 1. result from f(a,b) and f(b,a) should be different. 2. calculating f(a,b) for x times (x > 1), the result should be same.
To make clear about the question, this function f(n) = (n * p) % q (where n=input sequence value, p=step size, q=maximum result size, n=non-negative integer, n < q, p < q, p ⊥ q (coprime)) will give unique ID number.
But, in my requirement, input are two numbers, a and b can be negative or positive integer.
any reference is appreciable
You could generate a long (64 bit) from 2 integers (32 bit) by just right bit shifting the first integer with 32 and then add the second integer.
private long uniqueId(int left, int right) {
long uniqueId = (long) left;
uniqueId = uniqueId <<< 32;
uniqueId += (long) right;
return uniqueId;
}
Say your integers have a range in [MIN_INT,MAX_INT]. Then, given an integer n from this range, the function
f(n) = n - MIN_INT
attributes a unique positive integer f(n) in the range [0, MAX_INT - MIN_INT], which is often called a rank.
Denote M = MAX_INT - MIN_INT + 1. Then, to find a unique id g(n,m) of two concatenated integers n and m, you can use the common access style also used for two-dimensional arrays:
g(n,m) = f(n)*M + f(m)
That is, you simply offset the second integer by the largest possible value and count on.
Practically, of course, you have to be careful in order to avoid overflows -- that is, you should use some suited data types.
Here is an example: say your integers come from the range [-1,4], thus M=6. Then, for two integers n=3 and m=-1 out of this range, g(n,m) = 3*6 + 0 = 18 can be used as id.
I have encountered a surprisingly challenging problem arranging a matrix-like (List of Lists) of values subject to the following constraints (or deciding it is not possible):
A matrix of m randomly generated rows with up to n distinct values (no repeats within the row) arrange the matrix such that the following holds (if possible):
1) The matrix must be "lower triangular"; the rows must be ordered in ascending lengths so the only "gaps" are in the top right corner
2) If a value appears in more than one row it must be in the same column (i.e. rearranging the order of values in a row is allowed).
Expression of the problem/solution in a functional language (e.g. Scala) is desirable.
Example 1 - which has a solution
A B
C E D
C A B
becomes (as one solution)
A B
E D C
A B C
since A, B and C all appear in columns 1, 2 and 3, respectively.
Example 2 - which has no solution
A B C
A B D
B C D
has no solution since the constraints require the third row to have the C and D in the third
column which is not possible.
I thought this was an interesting problem and have modeled a proof-of-concept-version in MiniZinc (a very high level Constraint Programming system) which seems to be correct. I'm not sure if it's of any use, and to be honest I'm not sure if it's powerful for very largest problem instances.
The first problem instance has - according to this model - 4 solutions:
B A _
E D C
B A C
----------
B A _
D E C
B A C
----------
A B _
E D C
A B C
----------
A B _
D E C
A B C
The second example is considered unsatisfiable (as it should).
The complete model is here: http://www.hakank.org/minizinc/ordering_a_list_of_lists.mzn
The basic approach is to use matrices, where shorter rows are filled with a null value (here 0, zero). The problem instance is the matrix "matrix"; the resulting solution is in the matrix "x" (the decision variables, as integers which are then translated to strings in the output). Then there is a helper matrix, "perms" which are used to ensure that each row in "x" is a permutation of the corresponding row in "matrix", done with the predicate "permutation3". There are some other helper arrays/sets which simplifies the constraints.
The main MiniZinc model (sans output) is show below.
Here are some comments/assumptions which might make the model useless:
this is just a proof-of-concept model since I thought it was an interesting
problem.
I assume that the rows in the matrix (the problem data) is already ordered
by size (lower triangular). This should be easy to do as a preprocessing step
where Constraint Programming is not needed.
the shorter lists are filled with 0 (zero) so we can work with matrices.
since MiniZinc is a strongly typed language and don't support
symbols, we just define integers 1..5 to represent the letters A..E.
Working with integers is also beneficial when using traditional
Constraint Programming systems.
% The MiniZinc model (sans output)
include "globals.mzn";
int: rows = 3;
int: cols = 3;
int: A = 1;
int: B = 2;
int: C = 3;
int: D = 4;
int: E = 5;
int: max_int = E;
array[0..max_int] of string: str = array1d(0..max_int, ["_", "A","B","C","D","E"]);
% problem A (satifiable)
array[1..rows, 1..cols] of int: matrix =
array2d(1..rows, 1..cols,
[
A,B,0, % fill this shorter array with "0"
E,D,C,
A,B,C,
]);
% the valid values (we skip 0, zero)
set of int: values = {A,B,C,D,E};
% identify which rows a specific values are.
% E.g. for problem A:
% value_rows: [{1, 3}, {1, 3}, 2..3, 2..2, 2..2]
array[1..max_int] of set of int: value_rows =
[ {i | i in 1..rows, j in 1..cols where matrix[i,j] = v} | v in values];
% decision variables
% The resulting matrix
array[1..rows, 1..cols] of var 0..max_int: x;
% the permutations from matrix to x
array[1..rows, 1..cols] of var 0..max_int: perms;
%
% permutation3(a,p,b)
%
% get the permutation from a b using the permutation p.
%
predicate permutation3(array[int] of var int: a,
array[int] of var int: p,
array[int] of var int: b) =
forall(i in index_set(a)) (
b[i] = a[p[i]]
)
;
solve satisfy;
constraint
forall(i in 1..rows) (
% ensure unicity of the values in the rows in x and perms (except for 0)
alldifferent_except_0([x[i,j] | j in 1..cols]) /\
alldifferent_except_0([perms[i,j] | j in 1..cols]) /\
permutation3([matrix[i,j] | j in 1..cols], [perms[i,j] | j in 1..cols], [x[i,j] | j in 1..cols])
)
/\ % zeros in x are where there zeros are in matrix
forall(i in 1..rows, j in 1..cols) (
if matrix[i,j] = 0 then
x[i,j] = 0
else
true
endif
)
/\ % ensure that same values are in the same column:
% - for each of the values
% - ensure that it is positioned in one column c
forall(k in 1..max_int where k in values) (
exists(j in 1..cols) (
forall(i in value_rows[k]) (
x[i,j] = k
)
)
)
;
% the output
% ...
I needed a solution in a functional language (XQuery) so I implemented this first in Scala due to its expressiveness and I post the code below. It uses a brute-force, breadth first style search for solutions. I'm inly interested in a single solution (if one exists) so the algorithm throws away the extra solutions.
def order[T](listOfLists: List[List[T]]): List[List[T]] = {
def isConsistent(list: List[T], listOfLists: List[List[T]]) = {
def isSafe(list1: List[T], list2: List[T]) =
(for (i <- list1.indices; j <- list2.indices) yield
if (list1(i) == list2(j)) i == j else true
).forall(_ == true)
(for (row <- listOfLists) yield isSafe(list, row)).forall(_ == true)
}
def solve(fixed: List[List[T]], remaining: List[List[T]]): List[List[T]] =
if (remaining.isEmpty)
fixed // Solution found so return it
else
(for {
permutation <- remaining.head.permutations.toList
if isConsistent(permutation, fixed)
ordered = solve(permutation :: fixed, remaining.tail)
if !ordered.isEmpty
} yield ordered) match {
case solution1 :: otherSolutions => // There are one or more solutions so just return one
solution1
case Nil => // There are no solutions
Nil
}
// Ensure each list has unique items (i.e. no dups within the list)
require (listOfLists.forall(list => list == list.distinct))
/*
* The only optimisations applied to an otherwise full walk through all solutions is to sort the list of list so that the lengths
* of the lists are increasing in length and then starting the ordering with the first row fixed i.e. there is one degree of freedom
* in selecting the first row; by having the shortest row first and fixing it we both guarantee that we aren't disabling a solution from being
* found (i.e. by violating the "lower triangular" requirement) and can also avoid searching through the permutations of the first row since
* these would just result in additional (essentially duplicate except for ordering differences) solutions.
*/
//solve(Nil, listOfLists).reverse // This is the unoptimised version
val sorted = listOfLists.sortWith((a, b) => a.length < b.length)
solve(List(sorted.head), sorted.tail).reverse
}