Can Postgres table partitioning by hash boost query performance? - postgresql

I have an OLTP table on a Postgres 14.2 database that looks something like this:
Column | Type | Nullable |
----------------+-----------------------------+-----------
id | character varying(32) | not null |
user_id | character varying(255) | not null |
slug | character varying(255) | not null |
created_at | timestamp without time zone | not null |
updated_at | timestamp without time zone | not null |
Indexes:
"playground_pkey" PRIMARY KEY, btree (id)
"playground_user_id_idx" btree (user_id)
The database host has 8GB of RAM and 2 CPUs.
I have roughly 500M records in the table which adds up to about 80GB in size.
The table gets about 10K INSERT/h, 30K SELECT/h, and 5K DELETE/h.
The main query run against the table is:
SELECT * FROM playground WHERE user_id = '12345678' and slug = 'some-slug' limit 1;
Users have anywhere between 1 record to a few hundred records.
Thanks to the index on the user_id I generally get decent performance (double-digit milliseconds), but 5%-10% of the queries will take a few hundred milliseconds to maybe a second or two at worst.
My question is this: would partitioning the table by hash(user_id) help me boost lookup performance by taking advantage of partition pruning?

No, that wouldn't improve the speed of the query at all, since there is an index on that attribute. If anything, the increased planning time will slow down the query.
If you want to speed up that query as much as possible, create an index that supports both conditions:
CREATE INDEX ON playground (user_id, slug);
If slug is large, it may be preferable to index a hash:
CREATE INDEX ON playground (user_id, hashtext(slug));
and query like this:
SELECT *
FROM playground
WHERE user_id = '12345678'
AND slug = 'some-slug'
AND hashtext(slug) = hashtext('some-slug');
LIMIT 1;
Of course, partitioning could be a good idea for other reasons, for example to speed up autovacuum or CREATE INDEX.

Related

Sometime Postgresql Query takes to much time to load Data

I am using PostgreSQL 13.4 and has intermediate level experience with PostgreSQL.
I have a table which stores reading from a device for each customer. For each customer we receive around ~3,000 data in a day and we store them in to our usage_data table as below
We have the proper indexing in place.
Column | Data Type | Index name | Idx Access Type
-------------+-----------------------------+---------------------------+---------------------------
id | bigint | |
customer_id | bigint | idx_customer_id | btree
date | timestamp | idx_date | btree
usage | bigint | idx_usage | btree
amount | bigint | idx_amount | btree
Also I have common index on 2 columns which is below.
CREATE INDEX idx_cust_date
ON public.usage_data USING btree
(customer_id ASC NULLS LAST, date ASC NULLS LAST)
;
Problem
Few weird incidents I have observed.
When I tried to get data for 06/02/2022 for a customer it took almost 20 seconds. It's simple query as below
SELECT * FROM usage_data WHERE customer_id =1 AND date = '2022-02-06'
The execution plan
When I execute same query for 15 days then I receive result in 32 milliseconds.
SELECT * FROM usage_data WHERE customer_id =1 AND date > '2022-05-15' AND date <= '2022-05-30'
The execution plan
Tried Solution
I thought it might be the issue of indexing as for this particular date I am facing this issue.
Hence I dropped all the indexing from the table and recreate it.
But the problem didn't resolve.
Solution Worked (Not Recommended)
To solve this I tried another way. Created a new database and restored the old database
I executed the same query in new database and this time it takes 10 milliseconds.
The execution plan
I don't think this is proper solution for production server
Any idea why for any specific data we are facing this issue?
Please let me know if any additional information is required.
Please guide. Thanks

optimize a postgres query that updates a big table [duplicate]

I have two huge tables:
Table "public.tx_input1_new" (100,000,000 rows)
Column | Type | Modifiers
----------------|-----------------------------|----------
blk_hash | character varying(500) |
blk_time | timestamp without time zone |
tx_hash | character varying(500) |
input_tx_hash | character varying(100) |
input_tx_index | smallint |
input_addr | character varying(500) |
input_val | numeric |
Indexes:
"tx_input1_new_h" btree (input_tx_hash, input_tx_index)
Table "public.tx_output1_new" (100,000,000 rows)
Column | Type | Modifiers
--------------+------------------------+-----------
tx_hash | character varying(100) |
output_addr | character varying(500) |
output_index | smallint |
input_val | numeric |
Indexes:
"tx_output1_new_h" btree (tx_hash, output_index)
I want to update table1 by the other table:
UPDATE tx_input1 as i
SET
input_addr = o.output_addr,
input_val = o.output_val
FROM tx_output1 as o
WHERE
i.input_tx_hash = o.tx_hash
AND i.input_tx_index = o.output_index;
Before I execute this SQL command, I already created the index for this two table:
CREATE INDEX tx_input1_new_h ON tx_input1_new (input_tx_hash, input_tx_index);
CREATE INDEX tx_output1_new_h ON tx_output1_new (tx_hash, output_index);
I use EXPLAIN command to see the query plan, but it didn't use the index I created.
It took about 14-15 hours to complete this UPDATE.
What is the problem within it?
How can I shorten the execution time, or tune my database/table?
Thank you.
Since you are joining two large tables and there are no conditions that could filter out rows, the only efficient join strategy will be a hash join, and no index can help with that.
First there will be a sequential scan of one of the tables, from which a hash structure is built, then there will be a sequential scan over the other table, and the hash will be probed for each row found. How could any index help with that?
You can expect such an operation to take a long time, but there are some ways in which you could speed up the operation:
Remove all indexes and constraints on tx_input1 before you begin. Your query is one of the examples where an index does not help at all, but actually hurts performance, because the indexes have to be updated along with the table. Recreate the indexes and constraints after you are done with the UPDATE. Depending on the number of indexes on the table, you can expect a decent to massive performance gain.
Increase the work_mem parameter for this one operation with the SET command as high as you can. The more memory the hash operation can use, the faster it will be. With a table that big you'll probably still end up having temporary files, but you can still expect a decent performance gain.
Increase checkpoint_segments (or max_wal_size from version 9.6 on) to a high value so that there are fewer checkpoints during the UPDATE operation.
Make sure that the table statistics on both tables are accurate, so that PostgreSQL can come up with a good estimate for the number of hash buckets to create.
After the UPDATE, if it affects a big number of rows, you might consider to run VACUUM (FULL) on tx_input1 to get rid of the resulting table bloat. This will lock the table for a longer time, so do it during a maintenance window. It will reduce the size of the table and as a consequence speed up sequential scans.

Postgres returns records in wrong order

I have pretty big table messages. It contains about 100mil records.
When I run simple query:
select uid from messages order by uid asc limit 1000
The result is very strange. Records from the beginning are ok, but then they are not always ordered by column uid.
uid
----------
94621458
94637590
94653611
96545014
96553145
96581957
96590621
102907437
.....
446131576
459475933
424507749
424507166
459474125
431059132
440517049
446131301
475651666
413687676
.....
Here is analyze
Limit (cost=0.00..3740.51 rows=1000 width=4) (actual time=0.009..4.630 rows=1000 loops=1)
Output: uid
-> Index Scan using messages_pkey on public.messages (cost=0.00..376250123.91 rows=100587944 width=4) (actual time=0.009..4.150 rows=1000 loops=1)
Output: uid
Total runtime: 4.796 ms
PostgreSQL 9.1.12
The table is always under high load(inserts, updates, deletes) and almost constantly autovacuuming. May that cause the problem?
UPD. Added table definition. Sorry cannot add full table definition, but all impotant fields and their types are here.
# \d+ messages
Table "public.messages"
Column | Type | Modifiers | Storage | Description
--------------+-----------------------------+--------------------------------------------------------+----------+-------------
uid | integer | not null default nextval('messages_uid_seq'::regclass) | plain |
code | character(22) | not null | extended |
created | timestamp without time zone | not null | plain |
accountid | integer | not null | plain |
status | character varying(20) | not null | extended |
hash | character(3) | not null | extended |
xxxxxxxx | timestamp without time zone | not null | plain |
xxxxxxxx | integer | | plain |
xxxxxxxx | character varying(255) | | extended |
xxxxxxxx | text | not null | extended |
xxxxxxxx | character varying(250) | not null | extended |
xxxxxxxx | text | | extended |
xxxxxxxx | text | not null | extended |
Indexes:
"messages_pkey" PRIMARY KEY, btree (uid)
"messages_unique_code" UNIQUE CONSTRAINT, btree (code)
"messages_accountid_created_idx" btree (accountid, created)
"messages_accountid_hash_idx" btree (accountid, hash)
"messages_accountid_status_idx" btree (accountid, status)
Has OIDs: no
Here's a very general answer:
Try :
SET enable_indexscan TO off;
Rerun the query in the same session.
If the order of the results is different than with enable_indexscan to on, then the index is corrupted.
In this case, fix it with:
REINDEX INDEX index_name;
Save yourself the long wait trying to run the query without index. The problem is most probably due to a corrupted index. Repair it right away and see if that fixes the problem.
Since your table is always under high load, consider building a new index concurrently. Takes a bit longer, but does not block concurrent writes. Per documentation on REINDEX:
To build the index without interfering with production you should drop
the index and reissue the CREATE INDEX CONCURRENTLY command.
And under CREATE INDEX:
CONCURRENTLY
When this option is used, PostgreSQL will build the index without
taking any locks that prevent concurrent inserts, updates, or deletes
on the table; whereas a standard index build locks out writes (but not
reads) on the table until it's done. There are several caveats to be
aware of when using this option — see Building Indexes Concurrently.
So I suggest:
ALTER TABLE news DROP CONSTRAINT messages_pkey;
CREATE UNIQUE INDEX CONCURRENTLY messages_pkey ON news (uid);
ALTER TABLE news ADD PRIMARY KEY USING INDEX messages_pkey;
The last step is just a tiny update to the system catalogs.

PostgreSQL+table partitioning: inefficient max() and min()

I have a huge partitioned table stored at a PostgreSQL table. Each child table has an index and a check constraint on its id, e.g. (irrelevant deatils removed for clarity):
Master table: points
Column | Type | Modifiers
---------------+-----------------------------+------------------------
id | bigint |
creation_time | timestamp without time zone |
the_geom | geometry |
Sub-table points_01
Column | Type | Modifiers
---------------+-----------------------------+-------------------------
id | bigint |
creation_time | timestamp without time zone |
the_geom | geometry |
Indexes:
"points_01_pkey" PRIMARY KEY, btree (id)
"points_01_creation_time_idx" btree (creation_time)
"points_01_the_geom_idx" gist (the_geom) CLUSTER
Check constraints:
"enforce_srid_the_geom" CHECK (srid(the_geom) = 4326)
"id_gps_points_2010_08_22__14_47_04_check"
CHECK (id >= 1000000::bigint AND id <= 2000000::bigint)
Now,
SELECT max(id) FROM points_01
is instant, but:
SELECT max(id) FROM points
which is a master table for points_01 .. points_60 and should take very little time using check constraints, takes more than an hour because the query planner does not utilize the check constraints.
According to the PostgreSQL wiki (last section of this page), this is a known issue that would be fixed in the next versions.
Is there a good hack that will make the query planner utilize the check constraints and indices of sub-tables for max() and min() queries?
Thanks,
Adam
I don't know if it will work, but you could try this:
For that session, you could disable all access strategies but indexed ones:
db=> set enable_seqscan = off;
db=> set enable_tidscan = off;
db=> -- your query goes here
This way, only bitmapscan and indexscan would be enabled. PostgreSQL will have no choice but to use indexes to access data on the table.
After running your query, remember to reenable seqscan and tidscan by doing:
db=> set enable_seqscan = on;
db=> set enable_tidscan = on;
Otherwise, those access strategies will be disabled for the session from that point on.
Short answer: No. At this point in time, there's no way to make the Postgres planner understand that some aggregate functions can check the constraints on child partitions first. Its fairly easy to prove for specific case of min and max, but for aggregates in general, its a tough case.
You can always write it as a UNION of several partitions when it just has to be done...
I don't know much about postgres but you could could try this query (My query syntax may be not right due to lack of experience with postgres query's):
SELECT id FROM points a WHERE id > ALL (SELECT id FROM x WHERE x.id != a.id)
I'm curious if this works.

Oracle: TABLE ACCESS FULL with Primary key?

There is a table:
CREATE TABLE temp
(
IDR decimal(9) NOT NULL,
IDS decimal(9) NOT NULL,
DT date NOT NULL,
VAL decimal(10) NOT NULL,
AFFID decimal(9),
CONSTRAINT PKtemp PRIMARY KEY (IDR,IDS,DT)
)
;
Let's see the plan for select star query:
SQL>explain plan for select * from temp;
Explained.
SQL> select plan_table_output from table(dbms_xplan.display('plan_table',null,'serial'));
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
---------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
---------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 61 | 2 (0)|
| 1 | TABLE ACCESS FULL| TEMP | 1 | 61 | 2 (0)|
---------------------------------------------------------------
Note
-----
- 'PLAN_TABLE' is old version
11 rows selected.
SQL server 2008 shows in the same situation Clustered index scan. What is the reason?
select * with no where clause -- means read every row in the table, fetch every column.
What do you gain by using an index? You have to go to the index, get a rowid, translate the rowid into a table offset, read the file.
What happens when you do a full table scan? You go the th first rowid in the table, then read on through the table to the end.
Which one of these is faster given the table you have above? Full table scan. Why? because it skips having to to go the index, retreive values, then going back to the other to where the table lives and fetching.
To answer this more simply without mumbo-jumbo, the reason is:
Clustered Index = Table
That's by definition in SQL Server. If this is not clear, look up the definition.
To be absolutely clear once again, since most people seem to miss this, the Clustered Index IS the table itself. It therefore follows that "Clustered Index Scan" is another way of saying "Table Scan". Or what Oracle calls "TABLE ACCESS FULL"