Postgres clustering using multi-column indexes

Postgres clustering using multi-column indexes - postgresql

I have a table which includes a multi-column index defined as
CREATE INDEX tab_a_idx1 ON tab_a USING btree (device, fixtime)
The index was chosen deliberately because the majority of the queries run against this table include selection criteria like this
WHERE device = 'xyz' AND fixtime > 'sometime' AND fixtime <= 'someothertime' ORDER BY fixtime;
The table has been clustered on this index in a effort to improve performance.
CLUSTER tab_a USING tab_a_idx1;
Based on the comments and answers in a previous question I've used this query to list my clustered tables, the indexes they're clustered on, and the definitions of those indexes.
SELECT c.oid, c.relname as tablename, x.relname as indexname, z.indexdef
FROM pg_class c
JOIN pg_index i ON i.indrelid = c.oid
JOIN pg_class x ON i.indexrelid = x.oid
JOIN pg_indexes z ON x.relname = z.indexname
WHERE c.relkind = 'r' AND c.relhasindex AND i.indisclustered
And I've been using the pg_stats table to check the correlation of the indexed columns.
The quoted answer states that a correlation close to '1' is good, and as the value get lower the more clustering is indicated.
Immediately after the table was clustered the correlation of the 1st field in the index (device) was low (0.008) and the 2nd one (fixtime) relatively high (0.994).
If these values are supposed to be close to '1' but aren't, does that mean that a table can't (or shouldn't) be clustered on a multi-column index?
There are several versions of the tab_a (it's partitioned on fixtime) and I've noticed that the correlation values don't actually seem to vary much between the clustered and un-clustered versions of the table. Does this mean there's no point in clustering on this index?
Thanks
UPDATE - the parent table was created as follows....
CREATE TABLE tab_a
( device CHAR(6),
fixTime TIMESTAMP,
....lots more fields.....
)
PARTITION BY RANGE (fixTime);
The individual partitions were created like this
CREATE TABLE tab_a_201704 PARTITION OF tab_a FOR VALUES FROM ('2017-04-01' ) TO ( '2017-05-01' )
And the index used for the clustering like this....
CREATE INDEX tab_a_201704_idx2 ON tab_a_201704 (device, fixTime);
And the command to do the cluster....
CLUSTER tab_a_201704 USING tab_a_201704_idx2 ;

Related

How to get value list of list partitioning table of postgresql?

I am trying to use list partitioning in PostgreSQL.
https://www.postgresql.org/docs/current/ddl-partitioning.html
So, I have some questions about that.
Is there a limit on the number of values or partition tables in list partitioning?
When a partitioning table is created as shown below, can i check the value list with SQL? (like keys = [test, test_2])
CREATE TABLE part_table (id int, branch text, key_name text) PARTITION BY LIST (key_name);
CREATE TABLE part_default PARTITION OF part_table DEFAULT;
CREATE TABLE part_test PARTITION OF part_table FOR VALUES IN ('test');
CREATE TABLE part_test_2 PARTITION OF part_table FOR VALUES IN ('test_2');
When using the partitioning table created above, if data is added with key_name = "test_3", it is added to the default table. If 'test_3' exists in the default table and partitioning is attempted with the corresponding value, the following error occurs.
In this case, is there a good way to partition with the value 'test_3' without deleting the value in the default table?
CREATE TABLE part_test_3 PARTITION OF part_table FOR VALUES IN ('test_3');
Error: updated partition constraint for default partition "part_default" would be violated by some row
Is it possible to change the table name or value of a partition table?
Thank you..!

Is there a limit on the number of values or partition tables in list
partitioning?
Some test: https://www.depesz.com/2021/01/17/are-there-limits-to-partition-counts/
The value in current table and value reside in which partition.
SELECT
tableoid::pg_catalog.regclass,
array_agg(DISTINCT key_name)
FROM
part_table
GROUP BY
1;
To get all the current partition, and the configed value range. Use the following.
SELECT
c.oid::pg_catalog.regclass,
c.relkind,
inhdetachpending as is_detached,
pg_catalog.pg_get_expr(c.relpartbound, c.oid)
FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i
WHERE c.oid = i.inhrelid
AND i.inhparent = '58281'
--the following query will return 58281.
select c.oid
from pg_catalog.pg_class c
where relname ='part_table';

Does PostgreSQL's Statistics Collector track all usage of indexes?

After taking over the DBA duties on a fairly complex database, I wanted to eliminate any indexes that are consuming substantial disk space, but not being used. I ran the following, to identify unused indexes, sorting to prioritize those that consume the most space on disk:
SELECT
schemaname,
pg_stat_all_indexes.relname AS table,
pg_class.relname AS index,
pg_total_relation_size(oid) AS size,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_class
JOIN pg_stat_all_indexes ON pg_stat_all_indexes.indexrelname = pg_class.relname
WHERE
relkind =('i')
ORDER BY size DESC
I was a little surprised at just how many large indexes appear not to be used, at all -- as evidenced by a 0 for the idx_scan column. Some of these apparently-unused indexes include a function call that does something pretty specific (as in the contrived example below), and appear to have been set up to assist with API functionality.
--not real index
CREATE INDEX foo_transform_foo_name_idx
ON foo USING btree
(foo_transform_name(foo_name));
My question, here, is whether the Statistics Collector captures all uses of a particular index, even if those indexes were scanned from a SQL-language function, or in some other way?

These indexes have never been scanned. However, there are some other uses for indexes:
they enforce uniqueness and other constraints
they make ANALYZE gather statistics on indexed expressions
Use this query from my blog to find the indexes that you can drop without any negative consequences:
SELECT s.schemaname,
s.relname AS tablename,
s.indexrelname AS indexname,
pg_relation_size(s.indexrelid) AS index_size
FROM pg_catalog.pg_stat_user_indexes s
JOIN pg_catalog.pg_index i ON s.indexrelid = i.indexrelid
WHERE s.idx_scan = 0 -- has never been scanned
AND 0 <>ALL (i.indkey) -- no index column is an expression
AND NOT i.indisunique -- is not a UNIQUE index
AND NOT EXISTS -- does not enforce a constraint
(SELECT 1 FROM pg_catalog.pg_constraint c
WHERE c.conindid = s.indexrelid)
ORDER BY pg_relation_size(s.indexrelid) DESC;

PostgreSQL 12.4 query planner ignores sub-partition constraint, resulting in table scan

I have a table
T (A int, B int, C long, D varchar)
partitioned by each A and sub-partitioned by each B (i.e. list partitions with a single value each). A has cardinality of <10 and B has cardinality of <100. T has about 6 billion rows.
When I run the query
select distinct B from T where A = 1;
it prunes the top-level partitions (those where A != 1) but performs a table scan on all sub-partitions to find distinct values of B. I thought it would know, based on the partition design, that it would only have to check the partition constraint to determine the possible values of B given A, but alas, that is not the case.
There are no indexes on A or B, but there is a primary key on (C,D) at each partition, which seems immaterial, but figured I should mention it. I also have a BRIN index on C. Any idea why the Postgres query planner is not consulting the sub-partition constraints to avoid the table scan?

The reason is that nobody implemented such an optimization in the query planner. I cannot say that that surprises me, since it is a rather unusual query. Every such optimization built into the optimizer would mean that each query on a partitioned table that has a DISTINCT would need some extra query planning time, while only few queries would profit. Apart from the expense of writing and maintaining the code, that would be a net loss for most users.
Maybe you could use a metadata query:
CREATE TABLE list (id bigint NOT NULL, p integer NOT NULL) PARTITION BY LIST (p);
CREATE TABLE list_42 PARTITION OF list FOR VALUES IN (42);
CREATE TABLE list_101 PARTITION OF list FOR VALUES IN (101);
SELECT regexp_replace(
pg_get_expr(
p.relpartbound,
p.oid
),
'^FOR VALUES IN \((.*)\)$',
'\1'
)::integer
FROM pg_class AS p
JOIN pg_inherits AS i ON p.oid = i.inhrelid
WHERE i.inhparent = 'list'::regclass;
regexp_replace
----------------
42
101
(2 rows)

Postgresql 9.4 slow [duplicate]

I have table
create table big_table (
id serial primary key,
-- other columns here
vote int
);
This table is very big, approximately 70 million rows, I need to query:
SELECT * FROM big_table
ORDER BY vote [ASC|DESC], id [ASC|DESC]
OFFSET x LIMIT n -- I need this for pagination
As you may know, when x is a large number, queries like this are very slow.
For performance optimization I added indexes:
create index vote_order_asc on big_table (vote asc, id asc);
and
create index vote_order_desc on big_table (vote desc, id desc);
EXPLAIN shows that the above SELECT query uses these indexes, but it's very slow anyway with a large offset.
What can I do to optimize queries with OFFSET in big tables? Maybe PostgreSQL 9.5 or even newer versions have some features? I've searched but didn't find anything.

A large OFFSET is always going to be slow. Postgres has to order all rows and count the visible ones up to your offset. To skip all previous rows directly you could add an indexed row_number to the table (or create a MATERIALIZED VIEW including said row_number) and work with WHERE row_number > x instead of OFFSET x.
However, this approach is only sensible for read-only (or mostly) data. Implementing the same for table data that can change concurrently is more challenging. You need to start by defining desired behavior exactly.
I suggest a different approach for pagination:
SELECT *
FROM big_table
WHERE (vote, id) > (vote_x, id_x) -- ROW values
ORDER BY vote, id -- needs to be deterministic
LIMIT n;
Where vote_x and id_x are from the last row of the previous page (for both DESC and ASC). Or from the first if navigating backwards.
Comparing row values is supported by the index you already have - a feature that complies with the ISO SQL standard, but not every RDBMS supports it.
CREATE INDEX vote_order_asc ON big_table (vote, id);
Or for descending order:
SELECT *
FROM big_table
WHERE (vote, id) < (vote_x, id_x) -- ROW values
ORDER BY vote DESC, id DESC
LIMIT n;
Can use the same index.
I suggest you declare your columns NOT NULL or acquaint yourself with the NULLS FIRST|LAST construct:
PostgreSQL sort by datetime asc, null first?
Note two things in particular:
The ROW values in the WHERE clause cannot be replaced with separated member fields. WHERE (vote, id) > (vote_x, id_x) cannot be replaced with:
WHERE vote >= vote_x
AND id > id_x
That would rule out all rows with id <= id_x, while we only want to do that for the same vote and not for the next. The correct translation would be:
WHERE (vote = vote_x AND id > id_x) OR vote > vote_x
... which doesn't play along with indexes as nicely, and gets increasingly complicated for more columns.
Would be simple for a single column, obviously. That's the special case I mentioned at the outset.
The technique does not work for mixed directions in ORDER BY like:
ORDER BY vote ASC, id DESC
At least I can't think of a generic way to implement this as efficiently. If at least one of both columns is a numeric type, you could use a functional index with an inverted value on (vote, (id * -1)) - and use the same expression in ORDER BY:
ORDER BY vote ASC, (id * -1) ASC
Related:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Improve performance for order by with columns from many tables
Note in particular the presentation by Markus Winand I linked to:
"Pagination done the PostgreSQL way"

Have you tried partioning the table ?
Ease of management, improved scalability and availability, and a
reduction in blocking are common reasons to partition tables.
Improving query performance is not a reason to employ partitioning,
though it can be a beneficial side-effect in some cases. In terms of
performance, it is important to ensure that your implementation plan
includes a review of query performance. Confirm that your indexes
continue to appropriately support your queries after the table is
partitioned, and verify that queries using the clustered and
nonclustered indexes benefit from partition elimination where
applicable.
http://sqlperformance.com/2013/09/sql-indexes/partitioning-benefits

Postgresql Sorting a Joined Table with an index

I'm currently working on a complex sorting problem in Postgres 9.2
You can find the Source Code used in this Question(simplified) here: http://sqlfiddle.com/#!12/9857e/11
I have a Huge (>>20Mio rows) table containing various columns of different types.
CREATE TABLE data_table
(
id bigserial PRIMARY KEY,
column_a character(1),
column_b integer
-- ~100 more columns
);
Lets say i want to sort this table over 2 Columns (ASC).
But i don't want to do that with a simply Order By, because later I might need to insert rows in the sorted output and the user probably only wants to see 100 Rows at once (of the sorted output).
To achieve these goals i do the following:
CREATE TABLE meta_table
(
id bigserial PRIMARY KEY,
id_data bigint NOT NULL -- refers to the data_table
);
--Function to get the Column A of the current row
CREATE OR REPLACE FUNCTION get_column_a(bigint)
RETURNS character AS
'SELECT column_a FROM data_table WHERE id=$1'
LANGUAGE sql IMMUTABLE STRICT;
--Function to get the Column B of the current row
CREATE OR REPLACE FUNCTION get_column_b(bigint)
RETURNS integer AS
'SELECT column_b FROM data_table WHERE id=$1'
LANGUAGE sql IMMUTABLE STRICT;
--Creating a index on expression:
CREATE INDEX meta_sort_index
ON meta_table
USING btree
(get_column_a(id_data), get_column_b(id_data), id_data);
And then I copy the Id's of the data_table to the meta_table:
INSERT INTO meta_table(id_data) (SELECT id FROM data_table);
Later I can add additional rows to the table with a similar simple insert.
To get the Rows 900000 - 900099 (100 Rows) i can now use:
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
ORDER BY 1,2,3 OFFSET 900000 LIMIT 100;
(With an additional INNER JOIN on data_table if I want all the data.)
The Resulting Plan is:
Limit (cost=498956.59..499012.03 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.00..554396.21 rows=1000000 width=8)
This is a pretty efficient plan (Index Only Scans are new in Postgres 9.2).
But what is if I want to get Rows 20'000'000 - 20'000'099 (100 Rows)? Same Plan, much longer execution time. Well, to improve the Offset Performance (Improving OFFSET performance in PostgreSQL) I can do the following (Let's assume I saved every 100'000th Row away into another table).
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
WHERE (get_column_a(id_data), get_column_b(id_data), id_data ) >= (get_column_a(587857), get_column_b(587857), 587857 )
ORDER BY 1,2,3 LIMIT 100;
This runs much faster. The Resulting Plan is:
Limit (cost=0.51..61.13 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.51..193379.65 rows=318954 width=8)
Index Cond: (ROW((get_column_a(id_data)), (get_column_b(id_data)), id_data) >= ROW('Z'::bpchar, 27857, 587857))
So far everything works perfect and postgres does a great job!
Let's assume I want to change the Order of the 2nd Column to DESC.
But then I would have to change my WHERE Clause, because the > Operator compares both Columns ASC. The same query as above (ASC Ordering) could also be written as:
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
WHERE
(get_column_a(id_data) > get_column_a(587857))
OR (get_column_a(id_data) = get_column_a(587857) AND ((get_column_b(id_data) > get_column_b(587857))
OR ( (get_column_b(id_data) = get_column_b(587857)) AND (id_data >= 587857))))
ORDER BY 1,2,3 LIMIT 100;
Now the Plan Changes and the Query becomes slow:
Limit (cost=0.00..1095.94 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.00..1117877.41 rows=102002 width=8)
Filter: (((get_column_a(id_data)) > 'Z'::bpchar) OR (((get_column_a(id_data)) = 'Z'::bpchar) AND (((get_column_b(id_data)) > 27857) OR (((get_column_b(id_data)) = 27857) AND (id_data >= 587857)))))
How can I use the efficient older plan with DESC-Ordering?
Do you have any better ideas how to solve the Problem?
(I already tried to declare a own Type with own Operator Classes, but that's too slow)

You need to rethink your approach. Where to begin? This is a clear example, basically of the limits, performance-wise, of the sort of functional approach you are taking to SQL. Functions are largely planner opaque, and you are forcing two different lookups on data_table for every row retrieved because the stored procedure's plans cannot be folded together.
Now, far worse, you are indexing one table based on data in another. This might work for append-only workloads (inserts allowed but no updates) but it will not work if data_table can ever have updates applied. If the data in data_table ever changes, you will have the index return wrong results.
In these cases, you are almost always better off writing in the join as explicit, and letting the planner figure out the best way to retrieve the data.
Now your problem is that your index becomes a lot less useful (and a lot more intensive disk I/O-wise) when you change the order of your second column. On the other hand, if you had two different indexes on the data_table and had an explicit join, PostgreSQL could more easily handle this.