I have to decide whether to use GIN or GiST indexing for an hstore column.
The Postgres docs state:
GIN index lookups are about three times faster than GiST
GIN indexes take about three times longer to build than GiST
GIN indexes are about ten times slower to update than GiST
GIN indexes are two-to-three times larger than GiST
The way I interpret it, use GIN if you need to query a lot, use GiST if you need to update a lot.
In this test, all of the three disadvantages of GIN over GiST mentioned above are confirmed. However, other than suggested in the Postgres docs, the advantage of GIN over GiST (faster lookup) is very small. Slide 53 shows that in the test GIN was only 2% to 3% faster as opposed to 200% to 300% suggested in the Postgres docs.
Which source of information is more reliable and why?
The documents state what the situation is "in general".
However, you aren't running PostgreSQL "in general", you are running it on specific hardware with a specific pattern of use.
So - if you care a lot, then you'll want to test it yourself. A GiST index will always require re-checking its condition. However if the queries you run end up doing further checks anyway, a GIN index might not win there. Also there are all the usual issues around cache usage etc.
For my usage, on smaller databases with moderate update rates, I've been happy enough with GiST. I've seen a 50% improvement in speed with GIN (across a whole query), but it's not been worth the slower indexing. If I was building a huge archive server it might be different.
Related
I really like MongoDB, I use it at work and home, and not once yet have I hit a performance, complexity, or limitation issue with it. But I've been thinking about indexes a lot and I had a question I've not found an adequate answer to.
One of the big issues with SQL databases at scale is the relative complexity of queries. Specifically, MySQL uses b-trees for most of it's indexes, which querying takes O(log(n)), better than linear, but still means things take longer the more data you have.
A big attraction of noSQL databases is the removal/mitigation of this scaling issue, often relying instead on hash style indexes, which have O(1) lookup time, so having more data doesn't slow down your app. This is where my question comes in:
According to the offical MongoDB documentation, all indexes in Mongo use b-trees. Despite the fact that Mongo does in fact have a hashed index, as far as I can tell these are still stored in b-trees, same with the index on the _id field. I couldn't even find anything indicating anything about constant time anywhere in Mongo's documentation!
So my question is this: are, in fact, all indexes (including _id and hashed) in Mongo stored in b-trees? Does this mean querying for keys (even by _id) in fact takes O(log(n)) time?
Addendum: As a point of note, I'd be great if Mongo documentation provided some complexity formulas with examples queries. My favorite example of this is the Redis documentation.
Also: This is related. But I have the added specific questions regarding the hashed indexes and (more importantly) the _id index.
If you look at the code for indexing in mongodb (here), you can easily see that it's using btree for indexing. So the order of the algorithm is O(log n), but the base of this logarithm function is not 2, but 8192 instead, which is here in the code.
So for a million records we only have two levels (assuming the tree is balanced) and that is why it can find the record so fast. Overall, it's true the order is logarithmic, but since the base of the logarithm function is so large, it grows slowly.
I have a fragmentation problem on my production database. One of my main data tables is about 6GB(3GB Indexes) (about 9M records) in size and has 94%(!) index fragmentation.
I know that reorganizing indexes will solve this problem BUT my database is on SQL Server 2008R2 Express which has 10GB database limit and my database is already 8GB in size.
I have read few blog posts about this issue but non gave answer to my situation.
My Question1 is:
How much size(% or in GB) increase can I expect after reorganizing indexes on that table?
Question2:
Will Drop Index -> Build same index take less space? Time is not a factor for me at the moment.
Extra question:
Any other suggestions for database fragmentation? I know only to avoid shrinking like a fire ;)
Having INDEX on key columns will improve joins and Filters by negating the need for a table scan. A well maintained index can drastically improve performance.
It is Right that GUID's makes poor choice for indexed columns but by no means does it mean that you should not create these indexes. Ideally a data type of INT or BIGINT would be advisable.
For me Adding NEWID() as a default has shown some improvement in counteracting index fragmentation but if all alternatives fail you may have to do index maintenance (Rebuild, reorganize) operations more often than for other indexes. Reorganize needs some working space but in your scenario as time is not a concern, I would disable index, shrink DB and create index.
Based on your experience, is there any practical limit on the number of indexes per one table in Postresql? In theory, there is not, as per the documentation, citation: "Maximum Indexes per Table Unlimited" But:
Is it that the more indexes you have the slower the queries? Does it make a difference if I have tens vs hundreds or even thousands indexes? I am asking after I've read the documentation on postgres' partial indexes which makes me think of some very creative solutions that, however, require a lot of indexes.
There is overhead in having a high number of indexes in a few different ways:
Space consumption, although this would be lower with partial indexes of course.
Query optimisation, through making the choice of optimiser plan potentialy more complex.
Table modification time, through the additional work in modifying indexes when a new row is inserted, or current row deleted or modified.
I tend by default to go heavy on indexing as:
Space is generally pretty cheap
Queries with bound variables only need to be optimised once
Rows generally have to be found much more often than they are modified, so it's generally more important to design the system for efficiently finding rows than it is for reducing overhead in making modifications to them.
The impact of missing a required index can be very high, even if the index is only required occasionally.
I've worked on an Oracle system with denormalised reporting tables having over 200 columns with 100 of them indexed, and it was not a problem. Partial indexes would have been nice, but Oracle does not support them directly (you use a rather inconvenient CASE hack).
So I'd go ahead and get creative, as long as you're aware of the pros and cons, and preferably you would also measure the impact that you're having on the system.
I have been working on optimizing my Postgres databases recently, and traditionally, I've only ever use B-Tree indexes. However, I saw that GiST indexes suport non-unique, multicolumn indexes, in the Postgres 8.3 documentation.
I couldn't, however, see what the actual difference between them is. I was hoping that my fellow coders might beable to explain, what the pros and cons between them are, and more importantly, the reasons why I would use one over the other?
In a nutshell: B-Tree indexes perform better, but GiST indexes are more flexible. Usually, you want B-Tree indexes if they'll work for your data type. There was a recent post on the PG lists about a huge performance hit for using GiST indexes; they're expected to be slower than B-Trees (such is the price of flexibility), but not that much slower... work is, as you might expect, ongoing.
From a post by Tom Lane, a core PostgreSQL developer:
The main point of GIST is to be able to index queries that simply are
not indexable in btree. ... One would fully
expect btree to beat out GIST for btree-indexable cases. I think the
significant point here is that it's winning by a factor of a couple
hundred; that's pretty awful, and might point to some implementation
problem.
Basically everybody's right - btree is default index as it performs very well. GiST are somewhat different beasts - it's more of a "framework to write index types" than a index type on its own. You have to add custom code (in server) to use it, but on the other hand - they are very flexible.
Generally - you don't use GiST unless the datatype you're using tell you to do so. Example of datatypes that use GiST: ltree (from contrib), tsvector (contrib/tsearch till 8.2, in core since 8.3), and others.
There is well known, and pretty fast geographic extenstion to PostgreSQL - PostGIS (http://postgis.refractions.net/) which uses GiST for its purposes.
GiST are more general indexes. You can use them for broader purposes that the ones you would use with B-Tree. Including the ability to build a B-Tree using GiST.
I.E.: you can use GiST to index on geographical points, or geographical areas, something you won't be able to do with B-Tree indexes, since the only thing that matter on a B-Tree is the key (or keys) you are indexing on.
GiST indexes are lossy to an extent, meaning that the DBMS has to deal with false positives/negatives, i.e.:
GiST indexes are lossy because each document is represented in the index by a fixed-
length signature. The signature is
generated by hashing each word into a
random bit in an n-bit string, with
all these bits OR-ed together to
produce an n-bit document signature.
When two words hash to the same bit
position there will be a false match.
If all words in the query have matches
(real or false) then the table row
must be retrieved to see if the match
is correct.
b-trees do not have this behavior, so depending on the data being indexed, there may be some performance difference between the two.
See for text search behavior http://www.postgresql.org/docs/8.3/static/textsearch-indexes.html and http://www.postgresql.org/docs/8.3/static/indexes-types.html for a general purpose comparison.
From what information I could find, they both solve the same problems - more esoteric operations like array containment and intersection (&&, #>, <#, etc). However I would be interested in advice about when to use one or the other (or neither possibly).
The PostgreSQL documentation has some information about this:
GIN index lookups are about three times faster than GiST
GIN indexes take about three times longer to build than GiST
GIN indexes are about ten times slower to update than GiST
GIN indexes are two-to-three times larger than GiST
However I would be particularly interested to know if there is a performance impact when the memory to index size ration starts getting small (ie. the index size becomes much bigger than the available memory)? I've been told on the #postgresql IRC channel that GIN needs to keep all the index in memory, otherwise it won't be effective, because, unlike B-Tree, it doesn't know which part to read in from disk for a particular query? The question would be: is this true (because I've also been told the opposite of this)? Does GiST have the same restrictions? Are there other restrictions I should be aware of while using one of these indexing algorithms?
First of all, do you need to use them for text search indexing? GIN and GiST are index specialized for some data types. If you need to index simple char or integer values then the normal B-Tree index is the best.
Anyway, PostgreSQL documentation has a chapter on GIST and one on GIN, where you can find more info.
And, last but not least, the best way to find which is best is to generate sample data (as much as you need to be a real scenario) and then create a GIST index, measuring how much time is needed to create the index, insert a new value, execute a sample query. Then drop the index and do the same with a GIN index. Compare the values and you will have the answer you need, based on your data.