Is there a simple way to create a unique integer key from a two-integer composite key?

Is there a simple way to create a unique integer key from a two-integer composite key? - tsql

For various reasons that aren't too germane to the question, I've got a table with a composite key made out of two integers and I want to create a single unique key out of those two numbers. My initial thought was to just concatenate them, but I ran into a problem quickly when I realized that a composite key of (51,1) would result in the same unique key as (5,11), namely, 511.
Does anyone have a clever way to generate an integer out of two integers such that the generated integer is unique to the pair of start integers?
Edit: After being confronted with an impressive amount of math, I'm realizing that one detail I should have included is the sizes of the keys in question. In the originating pair, the first key is currently 6 digits and will probably stay in 7 digits for the life of the system; the second key has yet to get larger than 20. Given these constraints, it looks like the problem is much less daunting.

You can mathematically prove this is impossible if you want the resulting key to comprise the same number of bits as its two components. However, if you start with two 32 bit ints, and can use a 64 bit int for the result, you could obviously do something like this:
key1 << 32 | key2
SQL Syntax
SELECT key1 * POWER(2, 32) + key2

This has been discussed in a fair amount of detail already (as recursive said, however, the output must be comprised of more bits than the individual inputs).
Mapping two integers to one, in a unique and deterministic way
How to use two numbers as a Map key
http://en.wikipedia.org/wiki/Cantor_pairing_function#Cantor_pairing_function

Multiply one with a high enough value
SELECT id1 * 1000000 + id2
Or use text concatenation:
SELECT CAST(CAST(id1 AS nvarchar(10)) + RIGHT('000000' + CAST(id2 AS nvarchar(10)), 6) AS int)
Or skip the integer thing and separate the IDs with something non-numeric:
SELECT CAST(id1 AS nvarchar) + ':' + CAST(id2 AS nvarchar)

You can only do it if you have an upper bound for one of the keys. Say you have key1 and key2, and up1 is a value that key1 will never reach, then you can combine the keys like this:
combined = key2 * up1 + key1;
Even if the keys could theoretically grow without limit, it's usually possible to estimate a save upper bound in practice.

As I like the theoretical side of your question (it really is beautiful), and to contradict what many of the practical answers say, I would like to give an answer to the "math" part of your tags :)
In fact it is possible to map any two numbers (or actually any series of numbers) to a single number. This is called the Gödel number and was first published in 1931 by Kurt Gödel.
To give a quick example, with your question; say we have two variables v1 and v2. Then v3=2v1*3v2 would give a unique number. This number also uniquely identifies v1 and v2.
Of course the resulting number v3 may grow undesirably rapid. Please, just take this answer as a reply to the theoretical aspect in your question.

Both of the suggested solutions require some knowledge about the range of accepted keys.
To avoid making this assumption, one can riffle the digits together.
Key1 = ABC => Digits = A, B, C
Key2 = 123 => Digits = 1, 2, 3
Riffle(Key1, Key2) = A, 1, B, 2, C, 3
Zero-padding can be used when there aren't enough digits:
Key1 = 12345, Key2 = 1 => 1020304051
This method also generalizes for any number of keys.

wrote these for mysql they work fine
CREATE FUNCTION pair (x BIGINT unsigned, y BIGINT unsigned)
RETURNS BIGINT unsigned DETERMINISTIC
RETURN ((x + y) * (x + y + 1)) / 2 + y;
CREATE FUNCTION reversePairX (z BIGINT unsigned)
RETURNS BIGINT unsigned DETERMINISTIC
RETURN (FLOOR((-1 + SQRT(1 + 8 * z))/2)) * ((FLOOR((-1 + SQRT(1 + 8 * z))/2)) + 3) / 2 - z;
CREATE FUNCTION reversePairY (z BIGINT unsigned)
RETURNS BIGINT unsigned DETERMINISTIC
RETURN z - (FLOOR((-1 + SQRT(1 + 8 * z))/2)) * ((FLOOR((-1 + SQRT(1 + 8 * z))/2)) + 1) / 2;

At the risk of sounding facetious:
NewKey = fn(OldKey1, OldKey2)
where fn() is a function that looks up a new autonumbered key value from a column added to your existing table.
Obviously, two integer fields can hold exponentially more values than a single integer field.

Why don't you just use ROW_NUMBER() or IDENTITY(int,1,1) to set new ID? Do they REALLY need to be in relation?

Related

How can I extract a specific bit from a 16-bit register using math ONLY?

I have a 16-bit WORD and I want to read the status of a specific bit or several bits.
I've tried a method that divides the word by the bit that I want, converts the result to two values - an integer and to a real, and compares the two. if they are not equal, then it it equates to false. This appears to only work if i am looking for a bit that the last 'TRUE' bit in the word. If there are any successive TRUE bits, it fails. Perhaps I just haven't done it right. I don't have the ability to use code, just basic math, boolean operations, and type conversion. Any ideas? I hope this isn't a dumb question but i have a feeling it is.
eg:
WORD 0010000100100100 = 9348
I want to know the value of bit 2. how can i determine it from 9348?

There are many ways, depending on what operations you can use. It appears you don't have much to choose from. But this should work, using just integer division and multiplication, and a test for equality.
(psuedocode):
x = 9348 (binary 0010000100100100, bit 0 = 0, bit 1 = 0, bit 2 = 1, ...)
x = x / 4 (now x is 1000010010010000
y = (x / 2) * 2 (y is 0000010010010000)
if (x == y) {
(bit 2 must have been 0)
} else {
(bit 2 must have been 1)
}
Every time you divide by 2, you move the bits to the left one position (in your big endian representation). Every time you multiply by 2, you move the bits to the right one position. Odd numbers will have 1 in the least significant position. Even numbers will have 0 in the least significant position. If you divide an odd number by 2 in integer math, and then multiply by 2, you loose the odd bit if there was one. So the idea above is to first move the bit you want to know about into the least significant position. Then, divide by 2 and then multiply by two. If the result is the same as what you had before, then there must have been a 0 in the bit you care about. If the result is not the same as what you had before, then there must have been a 1 in the bit you care about.
Having explained the idea, we can simplify to
((x / 8) * 2) <> (x / 4)
which will resolve to true if the bit was set, and false if the bit was not set.

AND the word with a mask [1].
In your example, you're interested in the second bit, so the mask (in binary) is
00000010. (Which is 2 in decimal.)
In binary, your word 9348 is 0010010010000100 [2]
0010010010000100 (your word)
AND 0000000000000010 (mask)
----------------
0000000000000000 (result of ANDing your word and the mask)
Because the value is equal to zero, the bit is not set. If it were different to zero, the bit was set.
This technique works for extracting one bit at a time. You can however use it repeatedly with different masks if you're interested in extracting multiple bits.
[1] For more information on masking techniques see http://en.wikipedia.org/wiki/Mask_(computing)
[2] See http://www.binaryhexconverter.com/decimal-to-binary-converter

The nth bit is equal to the word divided by 2^n mod 2
I think you'll have to test each bit, 0 through 15 inclusive.

You could try 9348 AND 4 (equivalent of 1<<2 - index of the bit you wanted)
9348 AND 4
should give 4 if bit is set, 0 if not.

So here is what I have come up with: 3 solutions. One is Hatchet's as proposed above, and his answer helped me immensely with actually understanding HOW this works, which is of utmost importance to me! The proposed AND masking solutions could have worked if my system supports bitwise operators, but it apparently does not.
Original technique:
( ( ( INT ( TAG / BIT ) ) / 2 ) - ( INT ( ( INT ( TAG / BIT ) ) / 2 ) ) <> 0 )
Explanation:
in the first part of the equation, integer division is performed on TAG/BIT, then REAL division by 2. In the second part, integer division is performed TAG/BIT, then integer division again by 2. The difference between these two results is compared to 0. If the difference is not 0, then the formula resolves to TRUE, which means the specified bit is also TRUE.
eg: 9348/4 = 2337 w/ integer division. Then 2337/2 = 1168.5 w/ REAL division but 1168 w/ integer division. 1168.5-1168 <> 0, so the result is TRUE.
My modified technique:
( INT ( TAG / BIT ) / 2 ) <> ( INT ( INT ( TAG / BIT ) / 2 ) )
Explanation:
effectively the same as above, but instead of subtracting the two results and comparing them to 0, I am just comparing the two results themselves. If they are not equal, the formula resolves to TRUE, which means the specified bit is also TRUE.
eg: 9348/4 = 2337 w/ integer division. Then 2337/2 = 1168.5 w/ REAL division but 1168 w/ integer division. 1168.5 <> 1168, so the result is TRUE.
Hatchet's technique as it applies to my system:
( INT ( TAG / BIT )) <> ( INT ( INT ( TAG / BIT ) / 2 ) * 2 )
Explanation:
in the first part of the equation, integer division is performed on TAG/BIT. In the second part, integer division is performed TAG/BIT, then integer division again by 2, then multiplication by 2. The two results are compared. If they are not equal, the formula resolves to TRUE, which means the specified bit is also TRUE.
eg: 9348/4 = 2337. Then 2337/2 = 1168 w/ integer division. Then 1168x2=2336. 2337 <> 2336 so the result is TRUE. As Hatchet stated, this method 'drops the odd bit'.
Note - 9348/4 = 2337 w/ both REAL and integer division, but it is important that these parts of the formula use integer division and not REAL division (12164/32 = 380 w/ integer division and 380.125 w/ REAL division)
I feel it important to note for any future readers that the BIT value in the equations above is not the bit number, but the actual value of the resulting decimal if the bit in the desired position was the only TRUE bit in the binary string (bit 2 = 4 (2^2), bit 6 = 64 (2^6))
This explanation may be a bit too verbatim for some, but may be perfect for others :)
Please feel free to comment/critique/correct me if necessary!

I just needed to resolve an integer status code to a bit state in order to interface with some hardware. Here's a method that works for me:
private bool resolveBitState(int value, int bitNumber)
{
return (value & (1 << (bitNumber - 1))) != 0;
}
I like it, because it's non-iterative, requires no cast operations and essentially translates directly to machine code operations like Shift, And and Comparison, which probably means it's really optimal.
To explain in a little more detail, I'm comparing the bitwise value to a mask for the bit I am interested in (value & mask) using an AND operation. If the bitwise AND operation result is zero, then the bit is not set (return false). If the AND operation result is not zero, then the bit is set (return true). The result of the AND operation is either zero or the value of the bit (1, 2, 4, 8, 16, 32...). Hence the boolean evaluation comparing the AND operation result and 0. The mask is created by taking the number 1 and shifting it left (bit wise), by the appropriate number of binary places (1 << n). The number of places is the number of the bit targeted minus 1. If it's bit #1, I want to shift the 1 left by 0 and if it's #2, I want to shift it left 1 place, etc.
I'm surprised no one rates my solution. It think it's most logical and succinct... and works.

Is there any difference between integer and bit(n) data types for a bitmask?

I am working with a table in a PostgreSQL database that has several boolean columns that determine some state (e.g. published, visible, etc.). I want to make a single status column that will store all these values as well as possible new ones in a form of a bitmask. Is there any difference between integer and bit(n) in this case?
This is going to be a rather big table, because it stores objects that users create via a web-interface. So I think I will have to use (partial) indexes for this column.

If you only have a few variables I would consider keeping separate boolean columns.
Indexing is easy. In particular also indexes on expressions and partial indexes.
Conditions for queries are easy to write and read and meaningful.
A boolean column occupies 1 byte (no alignment padding). For only a few variables this occupies the least space.
Unlike other options boolean columns allow NULL values for individual bits if you should need that. You can always define columns NOT NULL if you don't.
If you have more than a hand full variables but no more than 32, an integer column may serve best. (Or a bigint for up to 64 variables.)
Occupies 4 bytes on disk (may require alignment padding, depending on preceding columns).
Very fast indexing for exact matches ( = operator).
Handling individual values may be slower / less convenient than with varbit or boolean.
With even more variables, or if you want to manipulate the values a lot, or if you don't have huge tables or disk space / RAM is not an issue, or if you are not sure what to pick, I would consider bit(n) or bit varying(n) (short: varbit(n).
Occupies at least 5 bytes (or 8 for very long strings) plus 1 byte for each group of 8 bits (rounded up).
You can use bit string functions and operators directly, and some standard SQL functions as well.
For just 3 bits of information, individual boolean columns get by with 3 bytes, an integer needs 4 bytes (maybe additional alignment padding) and a bit string 6 bytes (5 + 1).
For 32 bits of information, an integer still needs 4 bytes (+ padding), a bit string occupies 9 bytes for the same (5 + 4) and boolean columns occupy 32 bytes.
To optimize disk space further you need to understand the storage mechanisms of PostgreSQL, especially data alignment. More in this related answer.
This answer on how to transform the types boolean, bit(n) and integer may be of help, too.

You can apply the bit string functions directly to a bit string without the need to cast from an integer.

With the advent of GENERATED columns in PostgreSQL (as from version 12), you could do something like this (all of the code below is available on the fiddle here):
Base table:
CREATE TABLE test
(
t_id INTEGER GENERATED BY DEFAULT AS IDENTITY,
data TEXT,
bitmask VARBIT(9)
);
but, with GENERATED columns, you can now do:
CREATE TABLE test
(
t_id INTEGER GENERATED BY DEFAULT AS IDENTITY,
data TEXT,
bitmask VARBIT(9), -- choose 9 because it's not 8, to show that you don't have to
-- select an INT or even a SMALLINT
published BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 0)::BOOLEAN) STORED,
visible BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 1)::BOOLEAN) STORED,
rubbish BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 2)::BOOLEAN) STORED,
masterpiece BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 3)::BOOLEAN) STORED,
meh BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 4)::BOOLEAN) STORED,
arts BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 5)::BOOLEAN) STORED,
legal BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 6)::BOOLEAN) STORED,
sport BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 7)::BOOLEAN) STORED,
politics BOOLEAN GENERATED ALWAYS AS (GET_BIT(bitmask, 8)::BOOLEAN) STORED,
CONSTRAINT subject_ck -- so you can't have conflicting subjects - just for demo purposes
CHECK -- a document can't be art and legal at the same time!
(
CASE
WHEN GET_BIT(bitmask, 5) = 1
THEN GET_BIT(bitmask, 6) = 0 AND GET_BIT(bitmask, 7) = 0 AND GET_BIT(bitmask, 8) = 0
WHEN GET_BIT(bitmask, 6) = 1
THEN GET_BIT(bitmask, 5) = 0 AND GET_BIT(bitmask, 7) = 0 AND GET_BIT(bitmask, 8) = 0
WHEN GET_BIT(bitmask, 7) = 1
THEN GET_BIT(bitmask, 6) = 0 AND GET_BIT(bitmask, 5) = 0 AND GET_BIT(bitmask, 8) = 0
WHEN GET_BIT(bitmask, 5) = 1
THEN GET_BIT(bitmask, 8) = 0 AND GET_BIT(bitmask, 7) = 0 AND GET_BIT(bitmask, 5) = 0
END
)
);
Now, the extra 9 booleans add about 12 bytes to the size of the table - if that isn't a problem, then we're good to go! Also, when (I presume shortly - as of writing 2022-09-09) PostgreSQL is enhanced with VIRTUAL columns, there'll be no space overhead at all.
The benefits of doing it like this is that it makes your SQL short and readable - instead of having to a bunch of ugly CASE statements, you'll simply be able to do the following:
INSERT INTO test (data, bitmask) VALUES
('Document 1', '000100000'),
('Document 2', '100000000'),
('Document 3', '101000001');
and also, stuff like this:
CREATE INDEX legal_ix ON test (legal) WHERE legal;
So now, obtaining all of the records is far easier on the eye - much more legible:
SELECT * FROM test;
Result:
t_id data bitmask published visible rubbish masterpiece meh arts legal sport politics
1 Document 1 000100000 f f f t f f f f f
2 Document 2 100000000 t f f f f f f f f
3 Document 3 101000001 t f t f f f f f t
You can also do:
BEGIN TRANSACTION; -- can't update the other way round or CHECK constraint will fail
-- CHECK constraints are not deferrable - can't be 8 & 6 simultaneously
UPDATE test
SET bitmask = SET_BIT(bitmask, 8, 0) WHERE data = 'Document 3';
UPDATE test
SET bitmask = SET_BIT(bitmask, 6, 1) WHERE data = 'Document 3';
COMMIT;
Result:
SELECT t_id, published, legal FROM test; -- legal has gone from f -> t
There are a few other bits and pieces in the fiddle.

how to create unique integer number from 3 different integers numbers(1 Oracle Long, 1 Date Field, 1 Short)

the thing is that, the 1st number is already ORACLE LONG,
second one a Date (SQL DATE, no timestamp info extra), the last one being a Short value in the range 1000-100'000.
how can I create sort of hash value that will be unique for each combination optimally?
string concatenation and converting to long later:
I don't want this, for example.
Day Month
12 1 --> 121
1 12 --> 121

When you have a few numeric values and need to have a single "unique" (that is, statistically improbable duplicate) value out of them you can usually use a formula like:
h = (a*P1 + b)*P2 + c
where P1 and P2 are either well-chosen numbers (e.g. if you know 'a' is always in the 1-31 range, you can use P1=32) or, when you know nothing particular about the allowable ranges of a,b,c best approach is to have P1 and P2 as big prime numbers (they have the least chance to generate values that collide).
For an optimal solution the math is a bit more complex than that, but using prime numbers you can usually have a decent solution.
For example, Java implementation for .hashCode() for an array (or a String) is something like:
h = 0;
for (int i = 0; i < a.length; ++i)
h = h * 31 + a[i];
Even though personally, I would have chosen a prime bigger than 31 as values inside a String can easily collide, since a delta of 31 places can be quite common, e.g.:
"BB".hashCode() == "Aa".hashCode() == 2122

Your
12 1 --> 121
1 12 --> 121
problem is easily fixed by zero-padding your input numbers to the maximum width expected for each input field.
For example, if the first field can range from 0 to 10000 and the second field can range from 0 to 100, your example becomes:
00012 001 --> 00012001
00001 012 --> 00001012

In python, you can use this:
#pip install pairing
import pairing as pf
n = [12,6,20,19]
print(n)
key = pf.pair(pf.pair(n[0],n[1]),
pf.pair(n[2], n[3]))
print(key)
m = [pf.depair(pf.depair(key)[0]),
pf.depair(pf.depair(key)[1])]
print(m)
Output is:
[12, 6, 20, 19]
477575
[(12, 6), (20, 19)]

hash function providing unique uint from an integer coordinate pair

The problem in general:
I have a big 2d point space, sparsely populated with dots.
Think of it as a big white canvas sprinkled with black dots.
I have to iterate over and search through these dots a lot.
The Canvas (point space) can be huge, bordering on the limits
of int and its size is unknown before setting points in there.
That brought me to the idea of hashing:
Ideal:
I need a hash function taking a 2D point, returning a unique uint32.
So that no collisions can occur. You can assume that the number of
dots on the Canvas is easily countable by uint32.
IMPORTANT: It is impossible to know the size of the canvas beforehand
(it may even change),
so things like
canvaswidth * y + x
are sadly out of the question.
I also tried a very naive
abs(x) + abs(y)
but that produces too many collisions.
Compromise:
A hash function that provides keys with a very low probability of collision.

Cantor's enumeration of pairs
n = ((x + y)*(x + y + 1)/2) + y
might be interesting, as it's closest to your original canvaswidth * y + x but will work for any x or y. But for a real world int32 hash, rather than a mapping of pairs of integers to integers, you're probably better off with a bit manipulation such as Bob Jenkin's mix and calling that with x,y and a salt.

a hash function that is GUARANTEED collision-free is not a hash function :)
Instead of using a hash function, you could consider using binary space partition trees (BSPs) or XY-trees (closely related).
If you want to hash two uint32's into one uint32, do not use things like Y & 0xFFFF because that discards half of the bits. Do something like
(x * 0x1f1f1f1f) ^ y
(you need to transform one of the variables first to make sure the hash function is not commutative)

Like Emil, but handles 16-bit overflows in x in a way that produces fewer collisions, and takes fewer instructions to compute:
hash = ( y << 16 ) ^ x;

You can recursively divide your XY plane into cells, then divide these cells into sub-cells, etc.
Gustavo Niemeyer invented in 2008 his Geohash geocoding system.
Amazon's open source Geo Library computes the hash for any longitude-latitude coordinate. The resulting Geohash value is a 63 bit number. The probability of collision depends of the hash's resolution: if two objects are closer than the intrinsic resolution, the calculated hash will be identical.
Read more:
https://en.wikipedia.org/wiki/Geohash
https://aws.amazon.com/fr/blogs/mobile/geo-library-for-amazon-dynamodb-part-1-table-structure/
https://github.com/awslabs/dynamodb-geo

Your "ideal" is impossible.
You want a mapping (x, y) -> i where x, y, and i are all 32-bit quantities, which is guaranteed not to generate duplicate values of i.
Here's why: suppose there is a function hash() so that hash(x, y) gives different integer values. There are 2^32 (about 4 billion) values for x, and 2^32 values of y. So hash(x, y) has 2^64 (about 16 million trillion) possible results. But there are only 2^32 possible values in a 32-bit int, so the result of hash() won't fit in a 32-bit int.
See also http://en.wikipedia.org/wiki/Counting_argument
Generally, you should always design your data structures to deal with collisions. (Unless your hashes are very long (at least 128 bit), very good (use cryptographic hash functions), and you're feeling lucky).

Perhaps?
hash = ((y & 0xFFFF) << 16) | (x & 0xFFFF);
Works as long as x and y can be stored as 16 bit integers. No idea about how many collisions this causes for larger integers, though. One idea might be to still use this scheme but combine it with a compression scheme, such as taking the modulus of 2^16.

If you can do a = ((y & 0xffff) << 16) | (x & 0xffff) then you could afterward apply a reversible 32-bit mix to a, such as Thomas Wang's
uint32_t hash( uint32_t a)
a = (a ^ 61) ^ (a >> 16);
a = a + (a << 3);
a = a ^ (a >> 4);
a = a * 0x27d4eb2d;
a = a ^ (a >> 15);
return a;
}
That way you get a random-looking result rather than high bits from one dimension and low bits from the other.

You can do
a >= b ? a * a + a + b : a + b * b
taken from here.
That works for points in positive plane. If your coordinates can be in negative axis too, then you will have to do:
A = a >= 0 ? 2 * a : -2 * a - 1;
B = b >= 0 ? 2 * b : -2 * b - 1;
A >= B ? A * A + A + B : A + B * B;
But to restrict the output to uint you will have to keep an upper bound for your inputs. and if so, then it turns out that you know the bounds. In other words in programming its impractical to write a function without having an idea on the integer type your inputs and output can be and if so there definitely will be a lower bound and upper bound for every integer type.
public uint GetHashCode(whatever a, whatever b)
{
if (a > ushort.MaxValue || b > ushort.MaxValue ||
a < ushort.MinValue || b < ushort.MinValue)
{
throw new ArgumentOutOfRangeException();
}
return (uint)(a * short.MaxValue + b); //very good space/speed efficiency
//or whatever your function is.
}
If you want output to be strictly uint for unknown range of inputs, then there will be reasonable amount of collisions depending upon that range. What I would suggest is to have a function that can overflow but unchecked. Emil's solution is great, in C#:
return unchecked((uint)((a & 0xffff) << 16 | (b & 0xffff)));
See Mapping two integers to one, in a unique and deterministic way for a plethora of options..

According to your use case, it might be possible to use a Quadtree and replace points with the string of branch names. It is actually a sparse representation for points and will need a custom Quadtree structure that extends the canvas by adding branches when you add points off the canvas but it avoids collisions and you'll have benefits like quick nearest neighbor searches.

If you're already using languages or platforms that all objects (even primitive ones like integers) has built-in hash functions implemented (Java platform Languages like Java, .NET platform languages like C#. And others like Python, Ruby, etc ).
You may use built-in hashing values as a building block and add your "hashing flavor" in to the mix. Like:
// C# code snippet
public class SomeVerySimplePoint {
public int X;
public int Y;
public override int GetHashCode() {
return ( Y.GetHashCode() << 16 ) ^ X.GetHashCode();
}
}
And also having test cases like "predefined million point set" running against each possible hash generating algorithm comparison for different aspects like, computation time, memory required, key collision count, and edge cases (too big or too small values) may be handy.

the Fibonacci hash works very well for integer pairs
multiplier 0x9E3779B9
other word sizes 1/phi = (sqrt(5)-1)/2 * 2^w round to odd
a1 + a2*multiplier
this will give very different values for close together pairs
I do not know about the result with all pairs

Generate a hash sum for several integers

I am facing the problem of having several integers, and I have to generate one using them. For example.
Int 1: 14
Int 2: 4
Int 3: 8
Int 4: 4
Hash Sum: 43
I have some restriction in the values, the maximum value that and attribute can have is 30, the addition of all of them is always 30. And the attributes are always positive.
The key is that I want to generate the same hash sum for similar integers, for example if I have the integers, 14, 4, 10, 2 then I want to generate the same hash sum, in the case above 43. But of course if the integers are very different (4, 4, 2, 20) then I should have a different hash sum. Also it needs to be fast.
Ideally I would like that the output of the hash sum is between 0 and 512, and it should evenly distributed. With my restrictions I can have around 5K different possibilities, so what I would like to have is around 10 per bucket.
I am sure there are many algorithms that do this, but I could not find a way of googling this thing. Can anyone please post an algorithm to do this?.
Some more information
The whole thing with this is that those integers are attributes for a function. I want to store the values of the function in a table, but I do not have enough memory to store all the different options. That is why I want to generalize between similar attributes.
The reason why 10, 5, 15 are totally different from 5, 10, 15, it is because if you imagine this in 3d then both points are a totally different point
Some more information 2
Some answers try to solve the problem using hashing. But I do not think this is so complex. Thanks to one of the comments I have realized that this is a clustering algorithm problem. If we have only 3 attributes and we imagine the problem in 3d, what I just need is divide the space in blocks.
In fact this can be solved with rules of this type
if (att[0] < 5 && att[1] < 5 && att[2] < 5 && att[3] < 5)
Block = 21
if ( (5 < att[0] < 10) && (5 < att[1] < 10) && (5 < att[2] < 10) && (5 < att[3] < 10))
Block = 45
The problem is that I need a fast and a general way to generate those ifs I cannot write all the possibilities.

The simple solution:
Convert the integers to strings separated by commas, and hash the resulting string using a common hashing algorithm (md5, sha, etc).
If you really want to roll-your-own, I would do something like:
Generate large prime P
Generate random numbers 0 < a[i] < P (for each dimension you have)
To generate hash, calculate: sum(a[i] * x[i]) mod P

Given the inputs a, b, c, and d, each ranging in value from 0 to 30 (5 bits), the following will produce an number in the range of 0 to 255 (8 bits).
bucket = ((a & 0x18) << 3) | ((b & 0x18) << 1) | ((c & 0x18) >> 1) | ((d & 0x18) >> 3)
Whether the general approach is appropriate depends on how the question is interpreted. The 3 least significant bits are dropped, grouping 0-7 in the same set, 8-15 in the next, and so forth.
0-7,0-7,0-7,0-7 -> bucket 0
0-7,0-7,0-7,8-15 -> bucket 1
0-7,0-7,0-7,16-23 -> bucket 2
...
24-30,24-30,24-30,24-30 -> bucket 255
Trivially tested with:
for (int a = 0; a <= 30; a++)
for (int b = 0; b <= 30; b++)
for (int c = 0; c <= 30; c++)
for (int d = 0; d <= 30; d++) {
int bucket = ((a & 0x18) << 3) |
((b & 0x18) << 1) |
((c & 0x18) >> 1) |
((d & 0x18) >> 3);
printf("%d, %d, %d, %d -> %d\n",
a, b, c, d, bucket);
}

You want a hash function that depends on the order of inputs and where similar sets of numbers will generate the same hash? That is, you want 50 5 5 10 and 5 5 10 50 to generate different values, but you want 52 7 4 12 to generate the same hash as 50 5 5 10? A simple way to do something like this is:
long hash = 13;
for (int i = 0; i < array.length; i++) {
hash = hash * 37 + array[i] / 5;
}
This is imperfect, but should give you an idea of one way to implement what you want. It will treat the values 50 - 54 as the same value, but it will treat 49 and 50 as different values.
If you want the hash to be independent of the order of the inputs (so the hash of 5 10 20 and 20 10 5 are the same) then one way to do this is to sort the array of integers into ascending order before applying the hash. Another way would be to replace
hash = hash * 37 + array[i] / 5;
with
hash += array[i] / 5;
EDIT: Taking into account your comments in response to this answer, it sounds like my attempt above may serve your needs well enough. It won't be ideal, nor perfect. If you need high performance you have some research and experimentation to do.
To summarize, order is important, so 5 10 20 differs from 20 10 5. Also, you would ideally store each "vector" separately in your hash table, but to handle space limitations you want to store some groups of values in one table entry.
An ideal hash function would return a number evenly spread across the possible values based on your table size. Doing this right depends on the expected size of your table and on the number of and expected maximum value of the input vector values. If you can have negative values as "coordinate" values then this may affect how you compute your hash. If, given your range of input values and the hash function chosen, your maximum hash value is less than your hash table size, then you need to change the hash function to generate a larger hash value.

You might want to try using vectors to describe each number set as the hash value.
EDIT:
Since you're not describing why you want to not run the function itself, I'm guessing it's long running. Since you haven't described the breadth of the argument set.
If every value is expected then a full lookup table in a database might be faster.
If you're expecting repeated calls with the same arguments and little overall variation, then you could look at memoizing so only the first run for a argument set is expensive, and each additional request is fast, with less memory usage.

You would need to define what you mean by "similar". Hashes are generally designed to create unique results from unique input.
One approach would be to normalize your input and then generate a hash from the results.

Generating the same hash sum is called a collision, and is a bad thing for a hash to have. It makes it less useful.
If you want similar values to give the same output, you can divide the input by however close you want them to count. If the order makes a difference, use a different divisor for each number. The following function does what you describe:
int SqueezedSum( int a, int b, int c, int d )
{
return (a/11) + (b/7) + (c/5) + (d/3);
}
This is not a hash, but does what you describe.

You want to look into geometric hashing. In "standard" hashing you want
a short key
inverse resistance
collision resistance
With geometric hashing you susbtitute number 3 with something whihch is almost opposite; namely close initial values give close hash values.

Another way to view my problem is using the multidimesional scaling (MS). In MS we start with a matrix of items and what we want is assign a location of each item to an N dimensional space. Reducing in this way the number of dimensions.
http://en.wikipedia.org/wiki/Multidimensional_scaling

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Is there a simple way to create a unique integer key from a two-integer composite key? - tsql

At the risk of sounding facetious: NewKey = fn(OldKey1, OldKey2) where fn() is a function that looks up a new autonumbered key value from a column added to your existing table. Obviously, two integer fields can hold exponentially more values than a single integer field.

Why don't you just use ROW_NUMBER() or IDENTITY(int,1,1) to set new ID? Do they REALLY need to be in relation?

Related

How can I extract a specific bit from a 16-bit register using math ONLY?

Is there any difference between integer and bit(n) data types for a bitmask?

how to create unique integer number from 3 different integers numbers(1 Oracle Long, 1 Date Field, 1 Short)

hash function providing unique uint from an integer coordinate pair

Generate a hash sum for several integers

Categories

Resources