How does postgres calculate multi column hashes? - postgresql

In PG I have a table which is using partition by hash (*text*, *bigint*) looking at a previous answer the functions used for hashing can be seen however I'm unsure which function is used to build combined hashes?
Is it treating the partition keys as a record and using hash_record?
Ultimately I want to know this to rebuild the hashing function in Java to optimise reads and writes to specific partitions.

The hash_combine64 is used for calculating the ultimate hash value. According to the comments in the code it's based on the boost's hash_combine approach. Also, you can find the whole partition calculation algorithm in compute_partition_hash_value function.

Related

what is wrong with my extendible hashing solution?

I have the solution to extendible hashing below. I was wondering if hashing it this way is a correct representation? I know that your directories can also be a binary representation and you rehash it and increase your global depth every time a clash happens. But is this also a correct representation?
There is an extendable hash table with leaf size M=4 entries, and with the directory initially indexed using two senior bits. Consider insertion, into an initially empty table, of the keys that hash into the following values: 0100010, 0100100, 1000000, 0110101, 0101111, 1000001, 0100000, 1001000, 1001001, 1000010.
A. Show the state of the table after the first 8 insertions.
B. Show the state of the table after the remaining two insertions.

Hash Table Confusion - How much space is needed for Hash Table with a good (eg. Cryptographic) Hash Function?

I am learning about Hash Tables, Hash Maps etc. I have just implemented a Hash Table in C, with operations: insert(HTable, key), delete(HTable, key), initialize(HTable) and search(HTable, key).
I would like to ask something. Since in a (proper) Hash Table the computed hashed indexes could be very large, doesn't this mean that the space consumed will be like INT_MAX (which is still O(n) of course), or more? I mean given the input element that we want to store in a hash table (ie insert it in), the insert() function would call the hash function which would then compute the hashed index for the element to go in. Thus it would use the hash function to find this index.
When we use the hash function to operate on the element, the hashed index could become very large. With a proper, for example cryptographic hash function, this index could become huge (they are using prime numbers with 300 digits - Diffie Hellman public key cryptography etc.), right? I know that in normal hash functions (such as the trivial ones beginners use to learn) we apply mod operation in order for the element to fit within the hash table's bounds, but in doing so, maybe we limit the hash function potential?
So to uniquely map an element to the hash table we must use a HUGE Hash Table. How are these cryptographic hash tables implemented? They must be completely secure, right? Even the Stack Overflow tag on "cryptographichashfunction" says that it is extremely unlikely to find two inputs that will map to the same element (as such the possibility of collisions is tiny). Wouldn't this require though a HUGE array to be stored in memory (or to disk)? Therefore, the memory consumption would be huge.
Of course, the time complexity is not a problem. We just see the start address of the hash table / array add it with the index and just go that place in memory to get the value (O(1) - search principle of Hash Table).
Am i wrong somewhere? Is there something i'm missing? I hope i made myself clear. So to conclude, i would like confirmation on this. Does a good hash function require a huge array (Hash Table) and as such a very large amount of memory to be properly implemented? Is so much space justified, or is there something i don't quite get? Thanks.
In general cryptographic hash values are not used for hash tables. Instead a fast hash is used. Of that hash value only as many bits may be used to tweak the size of the table. If multiple key values map to the same index then the values are stored in a separate structure, possibly with additional information to choose between the two.
It is not required that the hash output is unique; the hash function output would be too large and the table required would certainly not fit in memory. Besides that, cryptographic hashes are generally quite slow.
Cryptographic hash functions are usually build from operations also used in symmetric block ciphers. That means mixing and bitwise operators used in a large amount of rounds. Modular arithmetic, as used for e.g. RSA are commonly not used.
All in all, the main thing is that the index generated doesn't need to be unique. Usually if one hash leeds to multiple values they are stored in a list or set where the key can be compared by value.

HBase row key design for monotonically increasing keys

I've an HBase table where I'm writing the row keys like:
<prefix>~1
<prefix>~2
<prefix>~3
...
<prefix>~9
<prefix>~10
The scan on the HBase shell gives an output:
<prefix>~1
<prefix>~10
<prefix>~2
<prefix>~3
...
<prefix>~9
How should a row key be designed so that the row with key <prefix>~10 comes last? I'm looking for some recommended ways or the ways that are more popular for designing HBase row keys.
How should a row key be designed so that the row with key ~10 comes last?
You see the scan output in this way because rowkeys in HBase are kept sorted lexicographically irrespective of the insertion order. This means that they are sorted based on their string representations. Remember that rowkeys in HBase are treated as an array of bytes having a string representation. The lowest order rowkey appears first in a table. That's why 10 appears before 2 and so on. See the sections Rows on this page to know more about this.
When you left pad the integers with zeros their natural ordering is kept intact while sorting lexicographically and that's why you see the scan order same as the order in which you had inserted the data. To do that you can design your rowkeys as suggested by #shutty.
I'm looking for some recommended ways or the ways that are more popular for designing HBase row keys.
There are some general guidelines to be followed in order to devise a good design :
Keep the rowkey as small as possible.
Avoid using monotonically increasing rowkeys, such as timestamp etc. This is a poor shecma design and leads to RegionServer hotspotting. If you can't avoid that use someway, like hashing or salting to avoid hotspotting.
Avoid using Strings as rowkeys if possible. String representation of a number takes more bytes as compared to its integer or long representation. For example : A long is 8 bytes. You can store an unsigned number up to 18,446,744,073,709,551,615 in those eight bytes. If you stored this number as a String -- presuming a byte per character -- you need nearly 3x the bytes.
Use some mechanism, like hashing, in order to get uniform distribution of rows in case your regions are not evenly loaded. You could also create pre-splitted tables to achieve this.
See this link for more on rowkey design.
HTH
HBase stores rowkeys in lexicographical order, so you can try to use this schema with fixed-length rowrey:
<prefix>~0001
<prefix>~0002
<prefix>~0003
...
<prefix>~0009
<prefix>~0010
Keep in mind that you also should use random prefixes to avoid region hot-spotting (when a single region accepts most of the writes, while the other regions are idle).
monotonically increasing keys isnt a good schema for hbase.
you can read more here:
http://hbase.apache.org/book/rowkey.design.html
there also a link there to OpenTSDB that solve this problem.
Fixed length keys are really recommended if possible. Bytes.toBytes(Long value) can be used to get a byte array from a counter. It will sort well for positive longs less than Long.MAX_VALUE.

What hash algorithms are most suitable for generating unique IDs in Postgres?

I have a large geospatial data set (~30m records) which I am currently importing into a PostgreSQL database. I need a unique ID to assign to each record, but an incrementing integer might be a bad idea because it could not be reliably recreated if I ever needed to reimport the data set.
It seems that a unique hash of the geometry data in a determined projection might be the best option for a reliable identifier. Being able to calculate the hash within Postgres would be beneficial, and speed would also be of benefit.
What is/are my options given this situation? Is there a particular method that is highly suitable for this situation?
If you need a unique identifier that depends on (and can be recreated from) the data, the most straightforward option seems to be a MD5 hash, which is included in Posgresql (no need of additional libraries) and is quite efficient and -for this scenario- secure.
The pgcrypto module provides additional hashing algorithms, eg SHA1.
Of course, you need to assert that the data to be hashed is unique.

Find global subscript midpoint

In Caché ObjectScript (Intersystems' dialect of MUMPS), is there a way to efficiently skip to the approximate midpoint or a linear point in the key for a global subscript range? Equal, based on the number of records.
I want to divide up the the subscript key range into approximately equal chunks and then process each chunk in parallel.
Knowing that the keys in a global are arranged in a binary tree of some kind, this should be a simple operation for the underlying data storage engine but I'm not sure if there is an interface to do this.
I can do it by scanning the global's whole keyspace but that would defeat the purpose of trying to run the operation in parallel. A sequential scan takes hours on this global. I need the keyspace divided up BEFORE I begin scanning.
I want each thread will to an approximately equal sized contiguous chunk of the keyspace to scan individually; the problem is calculating what key range to give each thread.
you can use second parameter "direction" (1 or -1) in function $order or $query
For my particular need, I found that the application I'm using has what I would call an index global. Another global maintained by the app with different keys, linking back to the main table. I can scan that in a fraction of the time and break up the keyset from there.
If someone comes up with a way to do what I want given only the main global, I'll change the accepted answer to that.