We are running PostgresSql 9.6.11 database on Amazon RDS. The execution time of one of the queries is 6633.645 ms. This seems very slow. What changes can I make to improve the execution time for this query.
The query is selecting 3 columns where the data matches 6 of the columns.
select
platform,
publisher_platform,
adset_id
FROM "adsets"
WHERE
(("adsets"."account_id" IN ('1595321963838425', '1320001405', 'urn:li:sponsoredAccount:507697540')) AND
("adsets"."date" >= '2019-05-06 00:00:00.000000+0000') AND ("adsets"."date" <= '2019-05-13 23:59:59.999999+0000'))
GROUP BY
"adsets"."platform",
"adsets"."publisher_platform",
"adsets"."adset_id"
ORDER BY
"adsets"."platform",
"adsets"."publisher_platform",
"adsets"."adset_id";
The query is based on a table called adset table. The table has the following columns
account_id | text
campaign_id | text
adset_id | text
name | text
date | timestamp without time zone
publisher_platform | text
and 15 other columns which are a mix of integers and text fields.
We have added the following indexes -
"adsets_composite_unique_key" UNIQUE CONSTRAINT, btree (platform, account_id, campaign_id, adset_id, date, publisher_platform)
"adsets_account_id_date_idx" btree (account_id DESC, date DESC) CLUSTER
"adsets_account_id_index" btree (account_id)
"adsets_adset_id_index" btree (adset_id)
"adsets_campaign_id_index" btree (campaign_id)
"adsets_name_index" btree (name)
"adsets_platform_platform_id_publisher_platform" btree (account_id, platform, publisher_platform, adset_id)
"idx_account_date_adsets" btree (account_id, date)
"platform_pub_index" btree (platform, publisher_platform, adset_id).
The work_mem on postgres has been set to 125MB
Explain (analyse) shows
Group (cost=33447.55..33532.22 rows=8437 width=29) (actual time=6625.170..6633.062 rows=2807 loops=1)
Group Key: platform, publisher_platform, adset_id
-> Sort (cost=33447.55..33468.72 rows=8467 width=29) (actual time=6625.168..6629.271 rows=22331 loops=1)
Sort Key: platform, publisher_platform, adset_id
Sort Method: quicksort Memory: 2513kB
-> Bitmap Heap Scan on adsets (cost=433.63..32895.18 rows=8467 width=29) (actual time=40.003..6471.898 rows=22331 loops=1)
Recheck Cond: ((account_id = ANY ('{1595321963838425,1320001405,urn:li:sponsoredAccount:507697540}'::text[])) AND (date >= '2019-05-06 00:00:00'::timestamp without time zone) AND (date <= '
2019-05-13 23:59:59.999999'::timestamp without time zone))
Heap Blocks: exact=52907
-> Bitmap Index Scan on idx_account_date_adsets (cost=0.00..431.51 rows=8467 width=0) (actual time=27.335..27.335 rows=75102 loops=1)
Index Cond: ((account_id = ANY ('{1595321963838425,1320001405,urn:li:sponsoredAccount:507697540}'::text[])) AND (date >= '2019-05-06 00:00:00'::timestamp without time zone) AND (date
<= '2019-05-13 23:59:59.999999'::timestamp without time zone))
Planning time: 5.380 ms
Execution time: 6633.645 ms
(12 rows)
Explain depesz
First of all, you are using GROUP BY without actually selecting any aggregates. You might as well just do SELECT DISTINCT in your query. This aside, here is the B tree index which you probably should be using:
CREATE INDEX idx ON adsets (account_id, date, platform, publisher_platform,
adset_id);
The problem with your current index is that, while it does cover the columns you are selecting, it does not involve the columns which appear in the WHERE clause. This means that Postgres might choose to not even use the index, and rather just scan the entire table.
Note that my suggestion still does nothing to deal with the select distinct portion of the query, but at least it might speed up everything which comes before that part of the query.
Here is your updated query:
SELECT DISTINCT
platform,
publisher_platform,
adset_id
FROM adsets
WHERE
account_id IN ('1595321963838425', '1320001405',
'urn:li:sponsoredAccount:507697540') AND
date >= '2019-05-06' AND date < '2019-05-14';
Your problem are the many “false positives” that are found during the bitmap index scan phase and removed during the heap scan phase. Since there is no additional filter, I guess that the extra rows must be removed because they are not visible.
See if a VACUUM adsets will improve the query performance.
Related
I try to improve query performances on a big (500M rows) time partitioned table. Here is the simplified table structure:
CREATE TABLE execution (
start_time TIMESTAMP WITH TIME ZONE NOT NULL,
end_time TIMESTAMP WITH TIME ZONE,
restriction_criteria VARCHAR(36) NOT NULL
PARTITION BY RANGE (start_time);
Time partitioning
is based on the start_time column because the end_time value is not known when the row is created.
is used to implement efficiently the retention policy.
Request looks like to this generic pattern
SELECT *
FROM execution
WHERE start_time BETWEEN :from AND start_time :to
AND restriction_criteria IN ('123', '456')
ORDER BY end_time DESC, id
FETCH NEXT 20 ROWS ONLY;
I've got the "best" performances using this index
CREATE INDEX IF NOT EXISTS end_time_desc_start_time_index ON execution USING btree (end_time DESC, start_time);
Yet, performances are not good enough.
Limit (cost=1303.21..27189.31 rows=20 width=1674) (actual time=6791.191..6791.198 rows=20 loops=1)
-> Incremental Sort (cost=1303.21..250693964.74 rows=193689 width=1674) (actual time=6791.189..6791.194 rows=20 loops=1)
" Sort Key: execution.end_time DESC, execution.id"
Presorted Key: execution.end_time
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 64kB Peak Memory: 64kB
-> Merge Append (cost=8.93..250685248.74 rows=193689 width=1674) (actual time=4082.161..6791.047 rows=21 loops=1)
Sort Key: execution.end_time DESC
Subplans Removed: 15
-> Index Scan using execution_2021_10_end_time_start_time_idx on execution_2021_10 execution_1 (cost=0.56..113448316.66 rows=93103 width=1674) (actual time=578.896..578.896 rows=1 loops=1)
Index Cond: ((start_time <= '2021-12-05 02:00:04+00'::timestamp with time zone) AND (start_time >= '2021-10-02 02:00:04+00'::timestamp with time zone))
" Filter: (((restriction_criteria)::text = ANY ('{123,456}'::text[])))"
Rows Removed by Filter: 734
-> Index Scan using execution_2021_11_end_time_start_time_idx on execution_2021_11 execution_2 (cost=0.56..113653576.54 rows=87605 width=1674) (actual time=116.841..116.841 rows=1 loops=1)
Index Cond: ((start_time <= '2021-12-05 02:00:04+00'::timestamp with time zone) AND (start_time >= '2021-10-02 02:00:04+00'::timestamp with time zone))
" Filter: (((restriction_criteria)::text = ANY ('{123,456}'::text[])))"
Rows Removed by Filter: 200
-> Index Scan using execution_2021_12_end_time_start_time_idx on execution_2021_12 execution_3 (cost=0.56..16367185.18 rows=12966 width=1674) (actual time=3386.416..6095.261 rows=21 loops=1)
Index Cond: ((start_time <= '2021-12-05 02:00:04+00'::timestamp with time zone) AND (start_time >= '2021-10-02 02:00:04+00'::timestamp with time zone))
" Filter: (((restriction_criteria)::text = ANY ('{123,456}'::text[])))"
Rows Removed by Filter: 5934
Planning Time: 4.108 ms
Execution Time: 6791.317 ms
The index Filter looks is very slow.
I set up a multi-column index hoping the filtering would be done in the Index cond. But it doesn't work
CREATE INDEX IF NOT EXISTS pagination_index ON execution USING btree (end_time DESC, start_time, restriction_criteria);
My feeling is that the first index column should be end_time because we want to leverage the btree index sorting capability. The second one should be restriction_criteria so that an index cond filters rows which doesn't match the restriction_criteria. However, this doesn't work because the query planner need to also check the start_time clause.
The alternative I imagine is to get rid of the partitioning because a multi-column end_time, restriction_critera index would work just fine.
Yet, this is not a perfect solution because dealing with our retention policy would become a pain.
Is there another alternative allowing to keep the start_time partitioning ?
I set up a multi-column index hoping the filtering would be done in the Index cond
The index machinery is very circumspect about what code it runs inside the index. It won't call any operators that it doesn't 'trust', because if the operator throws an error then the whole query will error out, possibly due to rows that weren't even user 'visible' in the first place (i.e. ones that were already deleted or created but never committed). No one wants that. Now the =ANY construct could be considered trustable, but it is not. That means it won't be applied in the Index Cond, but must be applied against the table row, which in turn means you need to visit the table, which is probably where all your time is going, visiting random table rows.
I don't know what it would take code-wise to make =ANY trusted. I've made efforts to investigate that in the past but really never got anywhere, the code around the ANY is too complicated for me to grasp. That would be a nice improvement for the future, but won't help you now anyway.
One way around this is to get an index-only scan. At that point it will call arbitrary code in the index, as it already knows the tuple is visible. But it won't do that for you, because you are selecting at least one column not in the index (and also not shown in your CREATE command, but obviously present anyway)
If you create an index like your widest one but adding "id" to the end, and only select from among those columns, then you should be get a much faster index only scans with merge appends.
If you have even more columns than the ones you've shown plus "id", and you really need to select those columns, and don't want to add all of them to the index, then you can use a trick to use an index-only scan anyway by doing a dummy self join:
with t as (SELECT id
FROM execution
WHERE start_time BETWEEN :from AND :to
AND restriction_criteria IN ('123', '456')
ORDER BY end_time DESC, id
FETCH NEXT 20 ROWS ONLY
)
select real.* from execution real join t using (id)
ORDER BY end_time DESC, id
(If "id" is not unique, then you might need to join on additional column. Also, you would need an index on "id", which you probably already have)
This one will still need to visit the table to fetch the extra columns, but only for the 20 rows being returned, not for all the ones failing the restriction_criteria.
If the restriction_criteria is very selective, another approach might be better: an index on or leading with that column. It will need to read and sort all of those rows (in the relevant partitions) before applying the LIMIT, but if it is very selective this will not take long.
While you can have the output sorted if the leading column is end_time you can reduce the amount of data processed if you use start_time as a leading column.
Since your filter in start_time and restriction_criteria, is excluding ~7000 rows in order to retrieve 20, maybe speeding up the filtering is more important that speeding up the sorting.
CREATE INDEX IF NOT EXISTS execution_start_time_restriction_idx
ON execution USING btree (start_time, restriction_criteria);
CREATE INDEX IF NOT EXISTS execution_restriction_start_time_idx
ON execution USING btree (restriction_criteria, start_time);
ANALYZE execution
If
FROM execution
WHERE start_time BETWEEN :from AND start_time :to
AND restriction_criteria IN ('123', '456')
Is more than the number of rows removed by the filter then having the `end_time as the leading column might be a good idea. But the planner should be able to figure that out for you.
In the end if some of those indices are not used you can drop it.
I was wondering if my indexes are working well since I am using nodejs and the dates with microseconds are not allowed in this language. So in my query for some good comparison I am doing this kind of thing:
`WHERE (created_at::timestamp(0), uuid) < (${createdAt}::timestamp(0), ${uuid})`
Since I am using a the cast which truncate to seconds, I supposed that the indexes are break. Am I right ? The solution then would be to change the precision of the timestamps stored, or is there another solution to keep the old ones ?
You could change the PostgreSQL data type to millisecond precision:
ALTER TABLE tab ALTER created_at TYPE timestamp(3) without time zone;
By using the recommended EXPLAIN(ANALYZE, VERBOSE, BUFFERS).
I created a table named users with a constraint on the created_at
create table users (
id uuid default uuid_generate_v4() not null
constraint users_pkey primary key,
created_at timestamp default CURRENT_TIMESTAMP
);
create index users_created_at_idx on users (created_at);
The test:
EXPLAIN(ANALYZE, VERBOSE, BUFFERS)
SELECT id
FROM users
WHERE (created_at >= '2022-01-21 15:43:33.631779');
Index Scan using users_created_at_idx on public.users (cost=0.14..4.16 rows=1 width=16) (actual time=0.010..0.018 rows=0 loops=1)
Output: id
Index Cond: (users.created_at >= '2022-01-21 15:43:33.631779'::timestamp without time zone)
Buffers: shared hit=1
Planning Time: 0.074 ms
Execution Time: 0.058 ms
EXPLAIN(ANALYZE, VERBOSE, BUFFERS)
SELECT id
FROM users
WHERE (created_at::timestamp(0) >= '2022-01-21 15:43:33.631779'::timestamp(0));
Seq Scan on public.users (cost=0.00..4.50 rows=33 width=16) (actual time=0.034..0.043 rows=0 loops=1)
Output: id
Filter: ((users.created_at)::timestamp(0) without time zone >= '2022-01-21 15:43:34'::timestamp(0) without time zone)
Rows Removed by Filter: 100
Buffers: shared hit=3
Planning Time: 0.073 ms
Execution Time: 0.089 ms
As we can see the index on the created_at column is not taken into account when we cast and truncate.
I have less than 200 partitions(Daily partitions) and each partition with 5M+ records.
When I pass one day data with direct partition I see estimated plan 0.01ms but while using parent table 190ms(too much). Only difference observed is Append in plan.
Can we eliminate Append or reduce pruning time in postgres 11?
QUERY:
explain (ANALYZE, VERBOSE, COSTS, BUFFERS, TIMING,SUMMARY) select 1 from test WHERE date1 >'2021-01-27 13:41:26' and date1<'2021-01-27 21:41:26' and own=123 and mob=123454234
----------------------------plan-----------
Append (cost=0.12..4.19 rows=1 width=4) (actual time=0.018..0.018 rows=0 loops=1)
Buffers: shared hit=1
-> Index Only Scan using test_20210127_pkey on test_20210127 (cost=0.12..4.17 rows=1 width=4) (actual time=0.017..0.017 rows=0 loops=1)
Output: 1
Index Cond: ((test_20210127.date1 > '2021-01-27 13:41:26'::timestamp without time zone) AND (test_20210127.date1 < '2021-01-27 21:41:26'::timestamp without time zone) AND (test_20210127.own = 123) AND (test_20210127.mob = 123454234))
Heap Fetches: 0
Buffers: shared hit=1
Planning Time: 190.440 ms
Execution Time: 0.093 ms
------------Snipped table structure----
CREATE TABLE public.test
(
own integer NOT NULL,
mob bigint NOT NULL,
date1 timestamp without time zone NOT NULL,
ver integer NOT NULL,
c5
...
c100
CONSTRAINT test_pkey PRIMARY KEY (date1, own, mob, ver)
USING INDEX TABLESPACE tb_1
) PARTITION BY RANGE (date1)
WITH (
OIDS = FALSE
)
TABLESPACE tb_1;
-- Partitions SQL
CREATE TABLE public.test_20211003 PARTITION OF public.test
FOR VALUES FROM ('2020-10-03 00:00:00') TO ('2020-10-04 00:00:00');
CREATE TABLE public.test_201004 PARTITION OF public.test
FOR VALUES FROM ('2020-10-04 00:00:00') TO ('2020-10-05 00:00:00');
........6 months partitions
You can upgrade to a later PostgreSQL version, as there were performance improvements in v12.
But if query execution time is short, planning time will always dominate. You can test a prepared statement, but I doubt that runtime partition pruning will be so much faster.
Essentially, the worse query performance is the expected price you are paying for the benefit of a simple way to discard old data.
I have 20 million Record in table its Schema is like below
FieldName Datatype
id bigint(Auto Inc,Primarykey)
name varchar(255)
phone varchar(255)
deleted_at timestamp
created_at timestamp
updated_at timestamp
It has index on name and phone column
Column Index type
name GIN trgm index
phone btree index, GIN trgm index
Created index using the following commands
CREATE INDEX btree_idx ON contacts USING btree (phone);
CREATE INDEX trgm_idx ON contacts USING GIN (phone gin_trgm_ops);
CREATE INDEX trgm_idx_name ON contacts USING GIN (name gin_trgm_ops);
I am running the below query
select * from contacts where phone like '%6666666%' limit 15;
I am doing contains query on phone. The above query takes more than 5 min to get a result. Let me provide the explain statement of this.
explain analyse select * from contacts where phone like '%6666666%' limit 15;
Limit (cost=1774.88..1830.57 rows=15 width=65) (actual time=7970.553..203001.985 rows=15 loops=1)
-> Bitmap Heap Scan on contacts (cost=1774.88..10819.13 rows=2436 width=65) (actual time=7970.552..203001.967 rows=15 loops=1)
Recheck Cond: ((phone)::text ~~ '%6666666%'::text)
Rows Removed by Index Recheck: 254869
Heap Blocks: lossy=2819
-> Bitmap Index Scan on trgm_idx (cost=0.00..1774.27 rows=2436 width=0) (actual time=6720.978..6720.978 rows=306226 loops=1)
Index Cond: ((phone)::text ~~ '%6666666%'::text)
Planning Time: 0.139 ms
Execution Time: 203002.791 ms
Here what can I do to optimize my query? and bring the result under 5 sec would be optimal
One cause of the bad performance is probably
Heap Blocks: lossy=2819
Your work_mem setting is to small to contain a bitmap with one bit per table row, so PostgreSQL degrades it to one bit per 8kB block. This leads to many more rechecks than necessary.
Also, your test is bad. The search string contains only the trigram 666, which will match many rows that don't satisfy the query and have to be removed during recheck. A trigram index is not effective in this pathological case. Test with a number that contains more digits.
I have a table with 36.64 million entries.
The table definition as follow:
id integer, PK
attribute, varchar 255
value, varchar 255
store_id, integer
timestamp, timestamp without timezone
mac_address, varchar 255
plus, mac_address and timestamp column has index.
the query:
select count(*) from table where mac_address = $1 and timestamp between $2 and $3
select * from table where mac_address = $1 and timestamp between $2 and $3
If I run this in pgAdmin, it took a total of 10 seconds.
If I run this using JPA, it took more than 40 seconds. There is no EAGER loading.
I've look into SimpleJpaRepository code. it is exactly these two query, a count() and a getResultList()
questions:
1. looks like timestamp index is not used in both pgAdmin and JPA. I've checked this with ANALYZE and EXPLAIN. But why?
2. Why does JPA needs 10x more time? ORM adds overhead, but 10 times?
3. How do I improve it?
EDIT 1:
Maybe the count() from JPA is not using index scan, it use sequential = slow. my postgresql version is 9.5.
EDIT 2:
in JPA, it is using setFirstResult() and setMaxResult() to get a total of 100 entries. From total of 259242
I try to mimic it with LIMIT and OFFSET, but I didn't see these keywords in JPA query. Maybe JPA is getting all result and then do paging in memory, which in turns cause performance issue?
The first execute of count() query takes 19 to 55 seconds using pgAdmin.
The EXPLAIN of the two query.
count()
Aggregate (cost=761166.10..761166.11 rows=1 width=4) (actual time=1273.871..1273.871 rows=1 loops=1)
Output: count(id)
Buffers: shared read=92986 written=56
-> Bitmap Heap Scan on public.device_messages playerstat0_ (cost=11165.36..760309.47 rows=342650 width=4) (actual time=76.217..1258.389 rows=259242 loops=1)
Output: id, attributecode, attributevalue, store_id, "timestamp", mac_address
Recheck Cond: (((playerstat0_.mac_address)::text = '0011E004CA34'::text) AND (playerstat0_."timestamp" >= '2018-04-04 00:00:00'::timestamp without time zone) AND (playerstat0_."timestamp" <= '2018-05-04 00:00:00'::timestamp without time zone))
Rows Removed by Index Recheck: 6281401
Heap Blocks: exact=36622 lossy=55083
Buffers: shared read=92986 written=56
-> Bitmap Index Scan on device_messages_mac_address_timestamp_idx (cost=0.00..11079.70 rows=342650 width=0) (actual time=69.636..69.636 rows=259242 loops=1)
Index Cond: (((playerstat0_.mac_address)::text = '0011E004CA34'::text) AND (playerstat0_."timestamp" >= '2018-04-04 00:00:00'::timestamp without time zone) AND (playerstat0_."timestamp" <= '2018-05-04 00:00:00'::timestamp without time zone))
Buffers: shared read=1281
Planning time: 0.138 ms
Execution time: 1274.275 ms
select
Limit (cost=3362.52..5043.49 rows=100 width=34) (actual time=30.291..42.846 rows=100 loops=1)
Output: id, attributecode, attributevalue, mac_address, store_id, "timestamp"
Buffers: shared hit=15447 read=1676"
-> Index Scan Backward using device_messages_pkey on public.device_messages playerstat0_ (cost=0.57..5759855.56 rows=342650 width=34) (actual time=2.597..42.834 rows=300 loops=1)
Output: id, attributecode, attributevalue, mac_address, store_id, "timestamp"
Filter: ((playerstat0_."timestamp" >= '2018-04-04 00:00:00'::timestamp without time zone) AND (playerstat0_."timestamp" <= '2018-05-04 00:00:00'::timestamp without time zone) AND ((playerstat0_.mac_address)::text = '0011E004CA34'::text))
Rows Removed by Filter: 154833
Buffers: shared hit=15447 read=1676
Planning time: 0.180 ms
Execution time: 42.878 ms
EDIT 3:
After more testing, it is confirmed that the cause is count(). select with limit and offset is pretty fast. The count() alone could take up to a minute.
mentioned here postgresql slow counting
While the count estimate function works (ROWS from query plan), I couldn't call that from JPA.
EDIT 3:
I kinda solve the problem, but not completely.
About select, after creating index which matches the query, it actually runs quite fast, 2~5 seconds. But that is without sorting. Sorting adds another process step to the query.
The count() is slow, and is confirmed by postgresql document. the MVCC force count() to do a heap scan, similar to sequence scan to the whole table.
The final problem which I still not sure it that the query on production server is mush slower than testing server. 60 seconds on production and 5 seconds on testing server. With same table size and data. But the big difference is production server has about 20+ insert operation per second. Testing server has no insert operation going on. I am guessing maybe the insert operation needs a write lock and so the query is slow because it has to wait for the lock?
You should be able to get better performance with an index of both mac_address and timestamp in the same index:
CREATE INDEX [CONCURRENTLY] ON table (mac_address, timestamp);
The reason the timestamp index is not used is because it would need to cross reference it with the mac_address index to find the correct rows (which would actually take longer than just looking up the rows directly)
I have no experience with JPA so I can't really say why it's slower.