I know postgresql discourages using hash indices. They actually say:
"Caution Hash index operations are not presently WAL-logged, so hash
indexes might need to be rebuilt with REINDEX after a database crash.
They are also not replicated over streaming or file-based replication.
For these reasons, hash index use is presently discouraged."
This is the good argument not to use them at all, but I can't understand why postgresql developers don't do effort to make hash indices first class citizens and to encourage their usage in certain situations rather than discourage to do it at all.
Actually if you only need to search for equality, hash indices should be far superior than any kind of trees, since they do search, insertion and deletion in o(1), and balanced trees naturally can't be better than o(log(n)). In worst case hash indices could work for o(n), but there is a bunch of known techniques to avoid worst case. If I were a db engine architect, such an argument should definitely rule my decision to make hash indices a viable alternative, but with postgresql it seems different. Is there a technical reason for this, or such decision is not technically motivated?
Tree indexes, by using for instance B+-trees and their variants, are so efficient that they are considered having costs of O(c), where c, the height of the tree, is a small constant (with c = 3 or 4 you can index millions of records), and usually at least one or two levels of such trees are cached, so that the number of disk accesses can be equal to 1 or 2 in most cases.
So, for practical purposes, they have performances similar to those of hash indexes, and, moreover, have the enormous advantage of allowing range searches.
Related
It seems that the more compound index I add to my collection it gets better to some point and then beyond that the more indexes the slower it becomes.
Is this possible? If so why?
EDITED:
I am referring to read queries. not write queries. I am aware that writes will be slower.
This is the case for any sort of index, not just compound indexes.
In MongoDB (and most databases) a lot of operations are sped up by having an index, at the cost of maintaining each index.
Generally speaking this shouldn't slow down things like a find but it will very much affect insert and update as those change the underlying data and thus requires modifying or rebuilding of each index those changes are linked to.
However, even with inserts and updates an index can help speed up those operations as the query engine can find the documents to update quicker.
In the end it very much a balance as the cost to maintain the indexes, and the space they take up ... can if you were to be overzealous (i.e. creating many, many less used indexes) ... counteract their helpfulness.
For a deeper dive into that, I'd suggest these docs:
https://www.mongodb.com/docs/manual/core/data-model-operations/#std-label-data-model-indexes
https://www.mongodb.com/docs/manual/core/index-creation/
I agree with the information that #Justin Jenkins shared in their answer, as there is absolutely write overhead associated with maintaining indexes. I don't think that answer focuses query performance much though which is what I understand this question to be about. I will give some thoughts about that below, though without additional details about the situation it will necessarily be a little generic.
Although indexes absolutely feel magical at times, they are still just a utility that we make available for the database to use when running operations. Ideally it would never be the case that adding an index would slow down the execution of a query, but unfortunately it can in some circumstances. This is not particularly common which is why it is not often an upfront talking point or concern.
Here are some important considerations:
The database is responsible for figuring out the index(es) that would result in the most efficient execution plan for every arbitrary query that is executed
Indexes are data structures. They take up space in memory when loaded from disk and must be traversed to be read.
The server hosting the database only has finite resources. Every time it uses some of those resources to maintain indexes it reduces the amount of resources available to process queries. It also introduces more possibilities for locking, yielding, or other contention to maintain consistency.
If you are observing a sudden and drastic degradation in query performance, I would tend to suspect a problem associated with the first consideration above. Again while not particularly common, it is possible that the increased number of indexes is now preventing the database from finding the optimal plan. This would be most likely if the query contained an $or operator, but can happen in other situations as well. Be on the lookout for a different index being reported in the winningPlan of the explain output for the query. It would usually happen after a specific number of indexes were created and/or if that new index(es) had a particular definition relevant to the query of interest.
A slower and more linear degradation in performance would seem to be for a different reason, such as the second or third items mentioned above. While memory/cache contention can certainly still degrade performance reasonably quickly, you would not see a shift in the query plans with one of these problems. What can happen here instead is now you have two indexes which (for simplicity) take up twice the amount of space now competing for the same limited space in memory. If what is requested exceeds what is available then the database will have to begin reading useful portions of the indexes (and data) into and out of its cache. This overhead can quickly add up and will result in operations now spending more time waiting for their portion of the index to be made available in memory for reading. I would expect a broader portion of queries to be impacted, though more moderately, in this situation.
In any case, the most actionable broad advice we can give would be for you to review and consolidate your existing indexes. There is a little bit of guidance on the topic here in the documentation. The general idea is that the prefix of the index (the keys at the beginning) are the important ones when it comes to usage for queries. Except for a few special circumstances, a single field index on { A: 1 } is completely redundant if you have a separate compound index on { A: 1, B: 1 }. Since the latter index can support all of the operations that the former one can, the former one (single field index in this example) should be removed.
Ultimately you may have to make some tradeoffs about which indexes to maintain and there may not be a 'perfect' index present for every single query. That's okay. Sometimes it is better to let one query do a little extra scanning when one of its predicate fields is not indexed as opposed to maintaining an entirely separate index. There is a tradeoff here at some point and, as #Justin Jenkins put it, it's important to go too far and become overzealous when creating indexes.
in our DB we have a large text field which we want to filter on exists/does not exist basis. So we don't need to perform any text search in it.
we assume that index would help, although it's not guaranteed the fiels wont exceed 1024 bytes. So that's not an option.
does hashed index on such field support $exists-filtering queries?
do hashed indexes impose any field-size limitations (in our experiments, hashed index is well capable of indexing fields where ordinary index fails)? We haven't found any explicit statement on this in docs though.
is chosen approach as a whole the correct one?
Yes, your approach is the correct one given the constraints. However, there are some caveats.
The performance advantage of an index compared to a collection scan is limited by the RAM available, since mongod tries to keep indices in RAM. If it can't (die to queries, for example), even an index will be read from disk, more or less eliminating the performance advantage in using it. So you should test wether the additional index does not push the RAM needed beyond the limits of your planned deployment.
The other, more severe problem is that you can not use said index to reliably distinguish unique documents with it, since there is no guarantee for uniqueness on hashes. Albeit a bit theoretical, you have to keep that in mind.
I really like MongoDB, I use it at work and home, and not once yet have I hit a performance, complexity, or limitation issue with it. But I've been thinking about indexes a lot and I had a question I've not found an adequate answer to.
One of the big issues with SQL databases at scale is the relative complexity of queries. Specifically, MySQL uses b-trees for most of it's indexes, which querying takes O(log(n)), better than linear, but still means things take longer the more data you have.
A big attraction of noSQL databases is the removal/mitigation of this scaling issue, often relying instead on hash style indexes, which have O(1) lookup time, so having more data doesn't slow down your app. This is where my question comes in:
According to the offical MongoDB documentation, all indexes in Mongo use b-trees. Despite the fact that Mongo does in fact have a hashed index, as far as I can tell these are still stored in b-trees, same with the index on the _id field. I couldn't even find anything indicating anything about constant time anywhere in Mongo's documentation!
So my question is this: are, in fact, all indexes (including _id and hashed) in Mongo stored in b-trees? Does this mean querying for keys (even by _id) in fact takes O(log(n)) time?
Addendum: As a point of note, I'd be great if Mongo documentation provided some complexity formulas with examples queries. My favorite example of this is the Redis documentation.
Also: This is related. But I have the added specific questions regarding the hashed indexes and (more importantly) the _id index.
If you look at the code for indexing in mongodb (here), you can easily see that it's using btree for indexing. So the order of the algorithm is O(log n), but the base of this logarithm function is not 2, but 8192 instead, which is here in the code.
So for a million records we only have two levels (assuming the tree is balanced) and that is why it can find the record so fast. Overall, it's true the order is logarithmic, but since the base of the logarithm function is so large, it grows slowly.
Based on your experience, is there any practical limit on the number of indexes per one table in Postresql? In theory, there is not, as per the documentation, citation: "Maximum Indexes per Table Unlimited" But:
Is it that the more indexes you have the slower the queries? Does it make a difference if I have tens vs hundreds or even thousands indexes? I am asking after I've read the documentation on postgres' partial indexes which makes me think of some very creative solutions that, however, require a lot of indexes.
There is overhead in having a high number of indexes in a few different ways:
Space consumption, although this would be lower with partial indexes of course.
Query optimisation, through making the choice of optimiser plan potentialy more complex.
Table modification time, through the additional work in modifying indexes when a new row is inserted, or current row deleted or modified.
I tend by default to go heavy on indexing as:
Space is generally pretty cheap
Queries with bound variables only need to be optimised once
Rows generally have to be found much more often than they are modified, so it's generally more important to design the system for efficiently finding rows than it is for reducing overhead in making modifications to them.
The impact of missing a required index can be very high, even if the index is only required occasionally.
I've worked on an Oracle system with denormalised reporting tables having over 200 columns with 100 of them indexed, and it was not a problem. Partial indexes would have been nice, but Oracle does not support them directly (you use a rather inconvenient CASE hack).
So I'd go ahead and get creative, as long as you're aware of the pros and cons, and preferably you would also measure the impact that you're having on the system.
I have been working on optimizing my Postgres databases recently, and traditionally, I've only ever use B-Tree indexes. However, I saw that GiST indexes suport non-unique, multicolumn indexes, in the Postgres 8.3 documentation.
I couldn't, however, see what the actual difference between them is. I was hoping that my fellow coders might beable to explain, what the pros and cons between them are, and more importantly, the reasons why I would use one over the other?
In a nutshell: B-Tree indexes perform better, but GiST indexes are more flexible. Usually, you want B-Tree indexes if they'll work for your data type. There was a recent post on the PG lists about a huge performance hit for using GiST indexes; they're expected to be slower than B-Trees (such is the price of flexibility), but not that much slower... work is, as you might expect, ongoing.
From a post by Tom Lane, a core PostgreSQL developer:
The main point of GIST is to be able to index queries that simply are
not indexable in btree. ... One would fully
expect btree to beat out GIST for btree-indexable cases. I think the
significant point here is that it's winning by a factor of a couple
hundred; that's pretty awful, and might point to some implementation
problem.
Basically everybody's right - btree is default index as it performs very well. GiST are somewhat different beasts - it's more of a "framework to write index types" than a index type on its own. You have to add custom code (in server) to use it, but on the other hand - they are very flexible.
Generally - you don't use GiST unless the datatype you're using tell you to do so. Example of datatypes that use GiST: ltree (from contrib), tsvector (contrib/tsearch till 8.2, in core since 8.3), and others.
There is well known, and pretty fast geographic extenstion to PostgreSQL - PostGIS (http://postgis.refractions.net/) which uses GiST for its purposes.
GiST are more general indexes. You can use them for broader purposes that the ones you would use with B-Tree. Including the ability to build a B-Tree using GiST.
I.E.: you can use GiST to index on geographical points, or geographical areas, something you won't be able to do with B-Tree indexes, since the only thing that matter on a B-Tree is the key (or keys) you are indexing on.
GiST indexes are lossy to an extent, meaning that the DBMS has to deal with false positives/negatives, i.e.:
GiST indexes are lossy because each document is represented in the index by a fixed-
length signature. The signature is
generated by hashing each word into a
random bit in an n-bit string, with
all these bits OR-ed together to
produce an n-bit document signature.
When two words hash to the same bit
position there will be a false match.
If all words in the query have matches
(real or false) then the table row
must be retrieved to see if the match
is correct.
b-trees do not have this behavior, so depending on the data being indexed, there may be some performance difference between the two.
See for text search behavior http://www.postgresql.org/docs/8.3/static/textsearch-indexes.html and http://www.postgresql.org/docs/8.3/static/indexes-types.html for a general purpose comparison.