I did some readings about hash tables recently. Many of them have different worst-case lookup time. In theory, the perfect constant lookup time is O(1). I want to understand more about hash table by inquiring:
Example of well-known and modern hash table algorithms, e.g. Cuckoo?
Their worst-case lookup time.
Current most popular hash table in database indexing area?
Related
I've read various posts and am still unclear. With a star schema, I would think that if I drive a query off a dimension table, say d_article, I end up with a set of SKs (sk_article) that are used to query/probe the main fact table. So, it makes sense to set sort keys on the fields commonly used in the Where clause on that dim table.
Next...and here's what I can't find an example or answer...should I include sk_article in a sort key in the fact table? More specifically, should I create an interleaved sort key with all the various SKs since we don't always use the same ones to join to the fact table?
I have seen no reference to including sort keys for use in Joins, only.
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
Amazon Redshift Foreign Keys - Sort or Interleaved Keys
Redshift Sort Key
Sort keys are just for sorting purpose, not for joining purpose. There can be multiple columns defined as Sort Keys. Data stored in the table can be sorted using these columns. The query optimizer uses this sort ordered table while determining optimal query plans.
Also, as Tony commented,
Sort Keys are primarily meant to optimize the effectiveness of the Zone Maps (sort of like a BRIN index) and enabling range restricted scans. They aren't all that useful on most dimension tables because dimension tables are typically small. The only time a Sort Key can help with join performance is if you set everything up for a Merge Join - that usually only makes sense for large fact-to-fact table joins. Interleaved Keys are more of a special case sort key and do not help with any joins.
Every type of those keys has specific purpose. This may be good read for you.
For joining, fact and dimension tables, you should be using distribution key.
Redshift Distribution Keys (DIST Keys)
It determine where data is stored in Redshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node. Here is good read for you.
I hope this answers your question.
A good video session is here, that may be really helpful in understanding SORT VS DIST Key.
I have table with around 60M of records and potentially it will grow up to ~500M soon (then will be growing slowly). In the table there is a column, say category. Total number of categories is around 20K and grows very slow and occasionally. Records are not distributed evenly among categories, there are categories that cover 5% of all records while other categories are represented by only very small proportion of records.
I have number of queries that work only with one or several categories (use = or IN/ANY conditions) and I want to optimize performance of these queries.
Taking into account low-selective nature of data in the column, which type of Postgres index will be more beneficial: HASH or B-TREE?
Are there any other ways to optimize performance of these queries?
I can only give a generalized answer to this broad question.
Use B-tree indexes, not hash indexes.
If you have several conditions that are not very selective, create an index on each of the columns, then they can be combined with a bitmap index scan.
In general, a column that is not very selective is not a good candidate for an index. Indexes are not free. They need to be maintained, and at query-time, in most cases, Postgres will still have to go out to the table for each row the index search matches (exception is covering indexes).
With that said, I'm not sure of your selectivity analysis. If the highest percent you'll filter down to worst-case is 5%, and most are far lower than that, that I'd say you have a very selective column.
As for which index type to use, b-tree versus hash, I generally go with a b-tree index as my standard unless there is a specific need to deviate.
Hash indexes are faster to query than b-tree indexes, but, they cannot be used for range lookups, only equality. Hash indexes are not supported on all RDBMS's, and as a result, are less well understood in the community, which can hinder support.
This is the controversial line from Cracking the Coding Interview on hash tables.
Another common implementation(besides linked-list) for a hash table is to use a BST as the underlying data structure.
I know this question has been asked before... it's so confusing because everyone is giving two different answers. For example
Why implement a Hashtable with a Binary Search Tree?
The highest voted answer in this post says that the quoted statement above is saying talking about a hash table implementation using a binary search tree, without an underlying array. I understood that since each element inserted gets a hash value (an integer), the elements form a total order (every pair can be compared with < and >). Therefore, we can simply use a binary search tree to hold the elements of the hash table.
On the other hand, others say
Hash table - implementing with Binary Search Tree
the book is saying that we should handle collisions with a binary search tree. So there is an underlying array and when collisions because multiple elements get the same hash value and get placed in the same slot in the array, that's where the BST comes in.
So each slot in the array will be a pointer to a BST, which holds elements with the same hash value.
I'm leaning towards the second post's argument because the first post does not really explain how such implementation of a hash table can handle collisions. And I don't think it can achieve expected O(1) insert/delete/lookup time.
But for the second post, if we have multiple elements that get the same hash value and placed in a BST, I'm not sure how these elements are ordered (how can they be compared against each other?)
Please, help me put an end to this question once and for all!
the first post does not really explain how such implementation of a hash table can handle collisions
With a BST, you can use a hashing function that would produce no duplicate keys so there would be no collisions. The advantage here isn't speed but to reduce memory consumption, and to have better worst-case guarantees. If you're writing software for a critical real-time system, you might not be able to tolerate a O(n) resizing of your hash table.
if we have multiple elements that get the same hash value and placed in a BST, I'm not sure how these elements are ordered (how can they be compared against each other?)
Rehash with another function.
In the end, it all depends on what your data structure is used for (Is memory vs. speed more important? Is amortized performance vs worst-case performance more important? etc.)
I know postgresql discourages using hash indices. They actually say:
"Caution Hash index operations are not presently WAL-logged, so hash
indexes might need to be rebuilt with REINDEX after a database crash.
They are also not replicated over streaming or file-based replication.
For these reasons, hash index use is presently discouraged."
This is the good argument not to use them at all, but I can't understand why postgresql developers don't do effort to make hash indices first class citizens and to encourage their usage in certain situations rather than discourage to do it at all.
Actually if you only need to search for equality, hash indices should be far superior than any kind of trees, since they do search, insertion and deletion in o(1), and balanced trees naturally can't be better than o(log(n)). In worst case hash indices could work for o(n), but there is a bunch of known techniques to avoid worst case. If I were a db engine architect, such an argument should definitely rule my decision to make hash indices a viable alternative, but with postgresql it seems different. Is there a technical reason for this, or such decision is not technically motivated?
Tree indexes, by using for instance B+-trees and their variants, are so efficient that they are considered having costs of O(c), where c, the height of the tree, is a small constant (with c = 3 or 4 you can index millions of records), and usually at least one or two levels of such trees are cached, so that the number of disk accesses can be equal to 1 or 2 in most cases.
So, for practical purposes, they have performances similar to those of hash indexes, and, moreover, have the enormous advantage of allowing range searches.
Lets assume in a system where we have about 10 million users. we need to cache those users objects after retrieval from the database into redis.
Now the question is would we store those JSON objects into a key value pairs something like "user_1" or a more appropriate solution would be having them all into the same hash "users" and the hash key would be the user ID "1 in this case"
I assume having key-value pairs would take much more memory than a hash, but what about the performance?
Since both global key space and hashes are hash tables, access time has complexity O(1). Performance shouldn't be an issue in both cases.
BTW, I would take a look at this official Redis docs' article about memory optimization. Its first paragraph states:
Since Redis 2.2 many data types are optimized to use less space up to
a certain size. Hashes, Lists, Sets composed of just integers, and
Sorted Sets, when smaller than a given number of elements, and up to a
maximum element size, are encoded in a very memory efficient way that
uses up to 10 times less memory (with 5 time less memory used being
the average saving).
Also, you said:
we have about 10 million users.
Then, either if you use globak key space or hashes, you should take a look at sharding with Redis Cluster. Probably this way you should be able to even optimize more your scenario.
3 years late, but given #Matias' comment on sharding with Redis Cluster it is worth noting that the unit of sharding is the key name. That means that all values in a hash will end up on the same server. So for millions of users, the global key space would allow for sharding, but not a hash.