Performance impact on SELECT query after adding clustered indices

Performance impact on SELECT query after adding clustered indices - sql-server-2008-r2

We have a SELECT operation that involves tables with around 20 millions of records. There are no Clustered Indices on the tables and there are Non-Clustered Indices stored in another Filegroup.
The system usually runs around 3 hours to complete. The system use TABLE SCAN for query and no non-clustered indices are involved.
So we tried to improve the duration by adding Clustered Indices back to the table. All indices and statistics are rebuilt.
However, now the system performs Clustered Index Scan instead of Table Scan and the result is much worse than before (3x slower).
So my questions are
Why clustered index scan is worse than table scan?
Is it a good idea to remove those Clustered Indices? INSERT/UPDATE/DELETE are not a concern at this moment.
Thanks for you help.

Related

What type of index is most suitable for a low-selective column

I have table with around 60M of records and potentially it will grow up to ~500M soon (then will be growing slowly). In the table there is a column, say category. Total number of categories is around 20K and grows very slow and occasionally. Records are not distributed evenly among categories, there are categories that cover 5% of all records while other categories are represented by only very small proportion of records.
I have number of queries that work only with one or several categories (use = or IN/ANY conditions) and I want to optimize performance of these queries.
Taking into account low-selective nature of data in the column, which type of Postgres index will be more beneficial: HASH or B-TREE?
Are there any other ways to optimize performance of these queries?

I can only give a generalized answer to this broad question.
Use B-tree indexes, not hash indexes.
If you have several conditions that are not very selective, create an index on each of the columns, then they can be combined with a bitmap index scan.

In general, a column that is not very selective is not a good candidate for an index. Indexes are not free. They need to be maintained, and at query-time, in most cases, Postgres will still have to go out to the table for each row the index search matches (exception is covering indexes).
With that said, I'm not sure of your selectivity analysis. If the highest percent you'll filter down to worst-case is 5%, and most are far lower than that, that I'd say you have a very selective column.
As for which index type to use, b-tree versus hash, I generally go with a b-tree index as my standard unless there is a specific need to deviate.
Hash indexes are faster to query than b-tree indexes, but, they cannot be used for range lookups, only equality. Hash indexes are not supported on all RDBMS's, and as a result, are less well understood in the community, which can hinder support.

How does PostgreSQL's CLUSTER differ from a clustered index in SQL Server?

Many posts like this stackoverflow link claim that there is no concept of a clustered index in PostgreSQL. However, the PostgreSQL documentation contains something similar. A few people claim it is similar to a clustered index in SQL Server.
Do you know what the exact difference between these two is, if there is any?

A clustered index or index organized table is a data structure where all the table data are organized in index order, typically by organizing the table in a B-tree structure.
Once a table is organized like this, the order is automatically maintained by all future data modifications.
PostgreSQL does not have such clustering indexes. What the CLUSTER command does is rewrite the table in the order of the index, but the table remains a fundamentally unordered heap of data, so future data modifications will not maintain that index order.
You have to CLUSTER a PostgreSQL table regularly if you want to maintain an approximate index order in the face of data modifications to the table.
Clustering in PostgreSQL can improve performance, because tuples found during an index scan will be close together in the heap table, which can turn random access to the heap to faster sequential access.

How does PostgreSQL perform ORDER BY with a b-tree index on the field?

I have a table bsort:
CREATE TABLE bsort(a int, data text);
Here data may be incomplete. In other words, some tuples may not have data value.
And then I build a b-tree index on the table:
CREATE INDEX ON bsort USING BTREE(a);
Now if I perform this query:
SELECT * FROM bsort ORDER BY a;
Does PostgreSQL sort tuples with nlogn complexity, or does it get the order directly from the b-tree index?

For a simple query like this Postgres will use an index scan and retrieve readily sorted tuples from the index in order. Due to its MVCC model Postgres had to always visit the "heap" (data pages) additionally to verify entries are actually visible to the current transaction. Quoting the Postgres Wiki on index-only scans:
PostgreSQL indexes do not contain visibility information. That is, it
is not directly possible to ascertain if any given tuple is visible to
the current transaction, which is why it has taken so long for index-only
scans to be implemented.
Which finally happened in version 9.2: index-only scans. The manual:
If the index stores the original indexed data values (and not some
lossy representation of them), it is useful to support index-only scans, in which the index returns the actual data not just the TID of
the heap tuple. This will only avoid I/O if the visibility map shows
that the TID is on an all-visible page; else the heap tuple must be
visited anyway to check MVCC visibility. But that is no concern of the
access method's.
The visibility map decides whether index-only scans are possible. Only an option if all involved column values are included in the index. Else, the heap has to be visited (additionally) in any case. The sort step is still not needed.
That's why we sometimes append otherwise useless columns to indexes now. Like the data column in your example:
CREATE INDEX ON bsort (a, data); -- btree is the default index type
It makes the index bigger (depends) and a bit more expensive to maintain and use for other purposes. So only append the data column if you get index-only scans out of it. The order of columns in the index is important:
Working of indexes in PostgreSQL
Is a composite index also good for queries on the first field?
Since Postgres 11, there are also "covering indexes" with the INCLUDE keyword. Like:
CREATE INDEX ON bsort (a) INCLUDE (data);
See:
Does a query with a primary key and foreign keys run faster than a query with just primary keys?
The benefit of an index-only scan, per documentation:
If it's known that all tuples on the page are visible, the heap fetch
can be skipped. This is most noticeable on large data sets where the
visibility map can prevent disk accesses. The visibility map is vastly
smaller than the heap, so it can easily be cached even when the heap
is very large.
The visibility map is maintained by VACUUM which happens automatically if you have autovacuum running (the default setting in modern Postgres). Details:
Are regular VACUUM ANALYZE still recommended under 9.1?
But there is some delay between write operations to the table and the next VACUUM run. The gist of it:
Read-only tables stay ready for index-only scans once vacuumed.
Data pages that have been modified lose their "all-visible" flag in the visibility map until the next VACUUM (and all older tansactions being finished), so it depends on the ratio between write operations and VACUUM frequency.
Partial index-only scans are still possible if some of the involved pages are marked all-visible. But if the heap has to be visited anyway, the access method "index scan" is a bit cheaper. So if too many pages are currently dirty, Postgres will switch to the cheaper index scan altogether. The Postgres Wiki again:
As the number of heap fetches (or "visits") that are projected to be
needed by the planner goes up, the planner will eventually conclude
that an index-only scan isn't desirable, as it isn't the cheapest
possible plan according to its cost model. The value of index-only
scans lies wholly in their potential to allow us to elide heap access
(if only partially) and minimise I/O.

You would need to check the execution plan. But, Postgres is quite capable of using the index to make the order by more efficient. It would read the records directly from the index. Because you have only one column, there is no need to access the data pages.

Reorganizing indexes and database size

I have a fragmentation problem on my production database. One of my main data tables is about 6GB(3GB Indexes) (about 9M records) in size and has 94%(!) index fragmentation.
I know that reorganizing indexes will solve this problem BUT my database is on SQL Server 2008R2 Express which has 10GB database limit and my database is already 8GB in size.
I have read few blog posts about this issue but non gave answer to my situation.
My Question1 is:
How much size(% or in GB) increase can I expect after reorganizing indexes on that table?
Question2:
Will Drop Index -> Build same index take less space? Time is not a factor for me at the moment.
Extra question:
Any other suggestions for database fragmentation? I know only to avoid shrinking like a fire ;)

Having INDEX on key columns will improve joins and Filters by negating the need for a table scan. A well maintained index can drastically improve performance.
It is Right that GUID's makes poor choice for indexed columns but by no means does it mean that you should not create these indexes. Ideally a data type of INT or BIGINT would be advisable.
For me Adding NEWID() as a default has shown some improvement in counteracting index fragmentation but if all alternatives fail you may have to do index maintenance (Rebuild, reorganize) operations more often than for other indexes. Reorganize needs some working space but in your scenario as time is not a concern, I would disable index, shrink DB and create index.

Does a non-clustered index performance increase when there is also a clustered (PK) index?

I have a very large table with two indexes on it, but no PK (clustered) index.
Would the performance of the two indexes increase if there was also a clustered index on the table, even if I have to "contrive" one from an identity column?

a well chosen clustered index can do miracles to your performance. Why? The clustered index defines how your data is physically stored on your Hard Disk. Choosing a good Clustered index will ensure you get Sequential IO instead of Random IO. Therefore this is a great performance gain, because the bottleneck in most database setups are the Hard Drives and the IO action.
Try to create your clustered index on a value that is used a lot by joins.
If you just put it on your Primary key your performance will still improve as the NON_Clustered will use the Clustered for their seek operation, which will avoid table scans.
Hope this answers your question

It's more like the opposite, non-clustered indexes suffer from clustered indexes:
http://use-the-index-luke.com/blog/2014-01/unreasonable-defaults-primary-key-clustering-key
However, if you manage to replace one of your non-clustered indexes by a clustered indexes, overall performance might increase...or decrease. Really depends on your workload.