How fragmentation happens in index? - postgresql

I had a table of 200GB with an index of 49GB. Only insert and update operation happens to that table. I dropped the existing index and created a new one on the same columns. New index size is only 6GB. I am using postgres database
Can someone explain how index size got reduced from 50GB to 6GB?

The newly created index is essentially optimally packed sorted data. To put some more data somewhere in the middle, while still maintaining the optimal packed sorted data you'd have to rewrite half of the index with every insert on average.
This is not acceptable, so the database uses some complicated and clever format for indexes (based on a b-tree data structure) that allow for changing the order of index blocks without moving them on disk. But the consequence of this is that after inserting some data in the middle some of the index data blocks are not 100% packed. The space left can be used in the future but only if the values inserted match to the block with regards of ordering.
So, depending on your usage pattern, you can easily have index blocks only 10% packed on average.
This is compounded by the fact that when you update a row both old and new version have to be present in the index at the same time. And if you do a bulk update of the whole table then the index will have to expand to contain twice the number of rows, although briefly. But it will not shrink back as easily, as this requires basically a rewrite of it all.
The index size tend to grow first and then stabilize after some usage. But the stable size is often nowhere near the size of a newly created one.
You might want to tune the autovacuum to be more aggressive - so the not needed anymore space in table and indexes is recovered faster and therefore can be reused faster. This can make your index stabilize faster and smaller. Also try to avoid too big bulk updates or do a vacuum full tablename after a huge update.

Related

How do I figure when an index was last used?

In Postgres I can run a query against pg_stat_user_indexes table to verify whether an index was ever scanned or not. I have quite a few indexes that have 0 scans and have a few with less than 5 scans. I am considering a possibility of removing those indexes but I want to know when they were last used. Is there a way to find out when the last index scan happened for an index?
No, you cannot find out when an index was last used. I recommend that you take the usage count now and again a month from now. Then see if the usage count has increased.
Don't hesitate to drop indexes that are rarely used, unless you are dealing with a data warehouse. Even if the occasional query can take longer, all data modifications on the table will become faster, which is a net win.

How does PostgreSQL perform ORDER BY with a b-tree index on the field?

I have a table bsort:
CREATE TABLE bsort(a int, data text);
Here data may be incomplete. In other words, some tuples may not have data value.
And then I build a b-tree index on the table:
CREATE INDEX ON bsort USING BTREE(a);
Now if I perform this query:
SELECT * FROM bsort ORDER BY a;
Does PostgreSQL sort tuples with nlogn complexity, or does it get the order directly from the b-tree index?
For a simple query like this Postgres will use an index scan and retrieve readily sorted tuples from the index in order. Due to its MVCC model Postgres had to always visit the "heap" (data pages) additionally to verify entries are actually visible to the current transaction. Quoting the Postgres Wiki on index-only scans:
PostgreSQL indexes do not contain visibility information. That is, it
is not directly possible to ascertain if any given tuple is visible to
the current transaction, which is why it has taken so long for index-only
scans to be implemented.
Which finally happened in version 9.2: index-only scans. The manual:
If the index stores the original indexed data values (and not some
lossy representation of them), it is useful to support index-only scans, in which the index returns the actual data not just the TID of
the heap tuple. This will only avoid I/O if the visibility map shows
that the TID is on an all-visible page; else the heap tuple must be
visited anyway to check MVCC visibility. But that is no concern of the
access method's.
The visibility map decides whether index-only scans are possible. Only an option if all involved column values are included in the index. Else, the heap has to be visited (additionally) in any case. The sort step is still not needed.
That's why we sometimes append otherwise useless columns to indexes now. Like the data column in your example:
CREATE INDEX ON bsort (a, data); -- btree is the default index type
It makes the index bigger (depends) and a bit more expensive to maintain and use for other purposes. So only append the data column if you get index-only scans out of it. The order of columns in the index is important:
Working of indexes in PostgreSQL
Is a composite index also good for queries on the first field?
Since Postgres 11, there are also "covering indexes" with the INCLUDE keyword. Like:
CREATE INDEX ON bsort (a) INCLUDE (data);
See:
Does a query with a primary key and foreign keys run faster than a query with just primary keys?
The benefit of an index-only scan, per documentation:
If it's known that all tuples on the page are visible, the heap fetch
can be skipped. This is most noticeable on large data sets where the
visibility map can prevent disk accesses. The visibility map is vastly
smaller than the heap, so it can easily be cached even when the heap
is very large.
The visibility map is maintained by VACUUM which happens automatically if you have autovacuum running (the default setting in modern Postgres). Details:
Are regular VACUUM ANALYZE still recommended under 9.1?
But there is some delay between write operations to the table and the next VACUUM run. The gist of it:
Read-only tables stay ready for index-only scans once vacuumed.
Data pages that have been modified lose their "all-visible" flag in the visibility map until the next VACUUM (and all older tansactions being finished), so it depends on the ratio between write operations and VACUUM frequency.
Partial index-only scans are still possible if some of the involved pages are marked all-visible. But if the heap has to be visited anyway, the access method "index scan" is a bit cheaper. So if too many pages are currently dirty, Postgres will switch to the cheaper index scan altogether. The Postgres Wiki again:
As the number of heap fetches (or "visits") that are projected to be
needed by the planner goes up, the planner will eventually conclude
that an index-only scan isn't desirable, as it isn't the cheapest
possible plan according to its cost model. The value of index-only
scans lies wholly in their potential to allow us to elide heap access
(if only partially) and minimise I/O.
You would need to check the execution plan. But, Postgres is quite capable of using the index to make the order by more efficient. It would read the records directly from the index. Because you have only one column, there is no need to access the data pages.

updating column with Index in postgres

I am updating over ~8M rows in table. The column I am updating is client_id and table has composite index on user_id and client_id. Is it going to affect the indexing in some way ...?
Doing a large update will be slower with indexes, since the indexes will have to be updated also. It might be an issue, or not. Depends on many things.
After the update the indexes will be as they should, but REINDEX might be in order to make their space usage better. If this 8M rows is majority of the table, VACUUM FULL may be in order to reduce disk space usage, but if the table is heavily updated all the time, it might not be worth it.
So if you want, you can remove the index, update and recreate the index, but it is impossible to say if that would be faster than doing update with the index in place.

Why and when is necessary to rebuild indexes in MongoDB?

Been working with MongoDB for a while and today I had a doubt while discussing with a colleague.
The thing is that when you create an index in MongoDB, the collection is processed and the index is built.
The index is updated within insertion and deletion of documents so I don't really see the need to run a rebuild index operation (which drops the index and then rebuild it).
According to MongoDB documentation:
Normally, MongoDB compacts indexes during routine updates. For most
users, the reIndex command is unnecessary. However, it may be worth
running if the collection size has changed significantly or if the
indexes are consuming a disproportionate amount of disk space.
Does someone has had the need of running a rebuild index operation that worth it?
As per the MongoDB documentation, there is generally no need to routinely rebuild indexes.
NOTE: Any advice on storage becomes more interesting with MongoDB 3.0+, which introduced a pluggable storage engine API. My comments below are specifically in reference to the default MMAP storage engine in MongoDB 3.0 and earlier. WiredTiger and other storage engines have different storage implementations for data & indexes.
There may be some benefit in rebuilding an index with the MMAP storage engine if:
An index is consuming a larger than expected amount of space compared to the data. Note: you need to monitor historical data & index size to have a baseline for comparison.
You want to migrate from an older index format to a newer one. If a reindex is advisible this will be mentioned in the upgrade notes. For example, MongoDB 2.0 introduced significant index performance improvements so the release notes include a suggested reindex to the v2.0 format after upgrading. Similarly, MongoDB 2.6 introduced 2dsphere (v2.0) indexes which have a different default behaviour (sparse by default). Existing indexes are not rebuilt after index version upgrades; the choice of if/when to upgrade is left to the database administrator.
You have changed the _id format for a collection to or from a monotonically increasing key (eg. ObjectID) to a random value. This is a bit esoteric, but there's an index optimisation that splits b-tree buckets 90/10 (instead of 50/50) if you are inserting _ids that are always increasing (ref: SERVER-983). If the nature of your _ids changes significantly, it may be possible to build a more efficient b-tree with a re-index.
For more information on general B-tree behaviour, see: Wikipedia: B-tree
Visualising index usage
If you're really curious to dig into the index internals a bit more, there are some experimental commands/tools you can try. I expect these are limited to MongoDB 2.4 & 2.6 only:
indexStats command
storage-viz tool
While I don't know the exact technical reasons why, in MongoDB, I can make some assumptions about this, based on what I know about indexing from other systems and based on the documentation that you quoted.
The General Idea Of An Index
When moving from one document to the next, in the full document collection, there is a lot of wasted time and effort skipping past all the data that doesn't need to be dealt with. If you're looking for document with id "1234", having to move through 100K+ of each document makes it slow
Rather than having to search through all of the content of each document in the collection (physically moving the disk read heads, etc), an index makes this fast. It's basically a key/value pair that gives you the id and the location of that document. MongoDB can quickly scan through all of the id's in the index, find the locations of the documents that it needs, and go load them directly.
Allocating File Size For An Index
Indexes take up disk space because they are basically a key/value pair stored in a much smaller location. If you have a very large collection (large number of items in the collection) then your index grows in size.
Most operating systems allocate chunks of disk space in certain block sizes. Most database also allocate disk space in large chunks, as needed.
Instead of growing 100K of file size when 100K of documents are added, MongoDB will probably grow 1MB or maybe 10MB or something - I don't know what the actual growth size is. In SQL Server, you can tell it how fast to grow, and MongoDB probably has something like that.
Growing in chunks give the ability to 'grow' the documents in to the space faster because the database doesn't need to constantly expand. If the database now has 10MB of space already allocated, it can just use that space up. It doesn't have to keep expanding the file for each document. It just has to write the data to the file.
This is probably true of collections and indexes for collections - anything that is stored on disk.
File Size And Index Re-Building
When a large collection has a lot of documents added and removed, the index becomes fragmented. index keys may not be in order because there was room in the middle of the index file and not at the end, when the index needed to be built. Index keys may have a lot of space in between them, as well.
If there are 10,000 items in the index, and # 10,001 needs to be inserted, it may be inserted in the middle of the index file. Now the index needs to re-build itself to put everything back in order. This involves moving a lot of data around, to make room at the end of the file and put item # 10,001 at the end.
If the index is constantly being thrashed - lots of stuff removed and added - it's probably faster to just grow the index file size and always put stuff at the end. this is fast to create the index, but leaves empty holes in the file where old things were deleted.
If the index file has empty space where deleted things used to be, this is wasted effort when reading the index. The index file has more movement than needed, to get to the next item in the index. So, the index repairs itself... which can be time consuming for very large collections or very large changes to a collection.
Rebuild For A Large Index File
It can take a lot of disk access and I/O operations to correctly compact the index file back down to a reasonable size, with everything in order. Move out of place items to temp location, free up space in right spot, move them back. Oh by the way, to free up space, you had to move other items to temp location. It's recursive and heavy-handed.
Therefore, if you have a very large number of items in a collection and that collection has items added and removed on a regular basis, the index may need to be rebuilt from scratch. Doing this would wipe the current index file and rebuild from the ground up - which is probably going to be faster than trying to do thousands of moves inside of the existing file. Rather than moving things around, it just writes them sequentially, from scratch.
Large Change In Collection Size
Giving everything I'm assuming above, a large change in the collection size would cause this kind of thrashing. If you have 10,000 documents in the collection and you delete 8,000 of them... well, now you have empty space in your index file where the 8,000 items used to be. MongoDB needs to move the remaining 2,000 items around in the physical file, to rebuild it in a compact form.
Instead of waiting around for 8,000 empty spaces to be cleaned up, it might be faster to rebuild from the ground up with the remaining 2,000 items.
Conclusion? Maybe?
So, the documentation that you quoted is probably going to deal with "big data" needs or high thrashing collections and indexes.
Also keep in mind that I'm making an educated guess based on what I know about indexing, disk allocation, file fragmentation, etc.
My guess is that "most users" in the documentation, means 99.9% or more of mongodb collections don't need to worry about this.
MongoDB specific case
According to MongoDB documentation:
The remove() method does not remove the indexes
So if you delete documents from a collection you are wasting disk space unless you rebuild the index for that collection.

Reorganizing indexes and database size

I have a fragmentation problem on my production database. One of my main data tables is about 6GB(3GB Indexes) (about 9M records) in size and has 94%(!) index fragmentation.
I know that reorganizing indexes will solve this problem BUT my database is on SQL Server 2008R2 Express which has 10GB database limit and my database is already 8GB in size.
I have read few blog posts about this issue but non gave answer to my situation.
My Question1 is:
How much size(% or in GB) increase can I expect after reorganizing indexes on that table?
Question2:
Will Drop Index -> Build same index take less space? Time is not a factor for me at the moment.
Extra question:
Any other suggestions for database fragmentation? I know only to avoid shrinking like a fire ;)
Having INDEX on key columns will improve joins and Filters by negating the need for a table scan. A well maintained index can drastically improve performance.
It is Right that GUID's makes poor choice for indexed columns but by no means does it mean that you should not create these indexes. Ideally a data type of INT or BIGINT would be advisable.
For me Adding NEWID() as a default has shown some improvement in counteracting index fragmentation but if all alternatives fail you may have to do index maintenance (Rebuild, reorganize) operations more often than for other indexes. Reorganize needs some working space but in your scenario as time is not a concern, I would disable index, shrink DB and create index.