Postgresql array function doesn't use index

Postgresql array function doesn't use index - postgresql

I've created a postgresql array column with a GIN index, and I'm trying to do a contains query on that column. With standard postgresql I can get that working correctly like this:
SELECT d.name
FROM deck d
WHERE d.card_names_array #> string_to_array('"Anger#2","Pingle Who Annoys#2"', ',')
LIMIT 20;
Explain analyze:
Limit (cost=1720.01..1724.02 rows=1 width=31) (actual time=7.787..7.787 rows=0 loops=1)
-> Bitmap Heap Scan on deck deck0_ (cost=1720.01..1724.02 rows=1 width=31) (actual time=7.787..7.787 rows=0 loops=1)
Recheck Cond: (card_names_array #> '{"\"Anger#2\"","\"Pingle Who Annoys#2\""}'::text[])
-> Bitmap Index Scan on deck_card_names_array_idx (cost=0.00..1720.01 rows=1 width=0) (actual time=7.785..7.785 rows=0 loops=1)
Index Cond: (card_names_array #> '{"\"Anger#2\"","\"Pingle Who Annoys#2\""}'::text[])
Planning time: 0.216 ms
Execution time: 7.810 ms
Unfortunately (in this instance) I'm using QueryDSL with which I read the native array functions like #> are impossible to use. However, this answer says you can use arraycontains instead to do the same thing. I got that working, and it returns the correct results, but it doesn't use my index.
SELECT d.name
FROM deck d
WHERE arraycontains(d.card_names_array, string_to_array('"Anger#2","Pingle Who Annoys#2"', ','))=true
LIMIT 20;
Explain analyze:
Limit (cost=0.00..18.83 rows=20 width=31) (actual time=1036.151..1036.151 rows=0 loops=1)
-> Seq Scan on deck deck0_ (cost=0.00..159065.60 rows=168976 width=31) (actual time=1036.150..1036.150 rows=0 loops=1)
Filter: arraycontains(card_names_array, '{"\"Anger#2\"","\"Pingle Who Annoys#2\""}'::text[])
Rows Removed by Filter: 584014
Planning time: 0.204 ms
Execution time: 1036.166 ms
This is my QueryDSL code to create the boolean expression:
predicate.and(Expressions.booleanTemplate(
"arraycontains({0}, string_to_array({1}, ','))=true",
deckQ.cardNamesArray,
filters.cards.joinToString(",") { "${it.cardName}#${it.quantity}" }
))
Is there some way to get it to use my index? Or maybe a different way to do this with QueryDSL to use the native #> function?

Related

PostgreSQL: Sequential scan despite having indexes

I have the following two tables.
person_addresses
address_normalization
The person_addresses table has a field named address_id as the primary key and address_normalization has the corresponding field address_id which has an index on it.
Now, when I explain the following query, I see a sequential scan.
SELECT
count(*)
FROM
mp_member2.person_addresses pa
JOIN mp_member2.address_normalization an ON
an.address_id = pa.address_id
WHERE
an.sr_modification_time >= 1550692189468;
-- Result: 2654
Please refer to the following screenshot.
You see that there is a sequential scan after the hash join. I'm not sure I understand this part; why would a sequential scan follow a hash join.
And as seen in the query above, the set of records returned is also low.
Is this expected behaviour or am I doing something wrong?
Update #1: I also have indices on the sr_modification_time fields of both the tables
Update #2: Full execution plan
Aggregate (cost=206944.74..206944.75 rows=1 width=0) (actual time=2807.844..2807.844 rows=1 loops=1)
Buffers: shared hit=4629 read=82217
-> Hash Join (cost=2881.95..206825.15 rows=47836 width=0) (actual time=0.775..2807.160 rows=2654 loops=1)
Hash Cond: (pa.address_id = an.address_id)
Buffers: shared hit=4629 read=82217
-> Seq Scan on person_addresses pa (cost=0.00..135924.93 rows=4911993 width=8) (actual time=0.005..1374.610 rows=4911993 loops=1)
Buffers: shared hit=4588 read=82217
-> Hash (cost=2432.05..2432.05 rows=35992 width=18) (actual time=0.756..0.756 rows=1005 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 41kB
Buffers: shared hit=41
-> Index Scan using mp_member2_address_normalization_mod_time on address_normalization an (cost=0.43..2432.05 rows=35992 width=18) (actual time=0.012..0.424 rows=1005 loops=1)
Index Cond: (sr_modification_time >= 1550692189468::bigint)
Buffers: shared hit=41
Planning time: 0.244 ms
Execution time: 2807.885 ms
Update #3: I tried with a newer timestamp and it used an index scan.
EXPLAIN (
ANALYZE
, buffers
, format TEXT
) SELECT
COUNT(*)
FROM
mp_member2.person_addresses pa
JOIN mp_member2.address_normalization an ON
an.address_id = pa.address_id
WHERE
an.sr_modification_time >= 1557507300342;
-- count: 1364
Query Plan:
Aggregate (cost=295.48..295.49 rows=1 width=0) (actual time=2.770..2.770 rows=1 loops=1)
Buffers: shared hit=1404
-> Nested Loop (cost=4.89..295.43 rows=19 width=0) (actual time=0.038..2.491 rows=1364 loops=1)
Buffers: shared hit=1404
-> Index Scan using mp_member2_address_normalization_mod_time on address_normalization an (cost=0.43..8.82 rows=14 width=18) (actual time=0.009..0.142 rows=341 loops=1)
Index Cond: (sr_modification_time >= 1557507300342::bigint)
Buffers: shared hit=14
-> Bitmap Heap Scan on person_addresses pa (cost=4.46..20.43 rows=4 width=8) (actual time=0.004..0.005 rows=4 loops=341)
Recheck Cond: (address_id = an.address_id)
Heap Blocks: exact=360
Buffers: shared hit=1390
-> Bitmap Index Scan on idx_mp_member2_person_addresses_address_id (cost=0.00..4.46 rows=4 width=0) (actual time=0.003..0.003 rows=4 loops=341)
Index Cond: (address_id = an.address_id)
Buffers: shared hit=1030
Planning time: 0.214 ms
Execution time: 2.816 ms

That is the expected behavior because you don't have index for sr_modification_time so after create the hash join db has to scan the whole set to check each row for the sr_modification_time value
You should create:
index for (sr_modification_time)
or composite index for (address_id , sr_modification_time )

Partition and Indexes

I have a table partitioned for every quarter. Table name is data. In table there is couple of columns but also date. date is a field which has index on it created:
create index on data (date);
Now I am trying to querying the table:
justpremium=> EXPLAIN analyze SELECT sum(col_1) FROM data WHERE "date" BETWEEN '2018-12-01' AND '2018-12-31';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=355709.66..355709.67 rows=1 width=32) (actual time=577.072..577.072 rows=1 loops=1)
-> Gather (cost=355709.44..355709.65 rows=2 width=32) (actual time=577.005..578.418 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=354709.44..354709.45 rows=1 width=32) (actual time=573.255..573.256 rows=1 loops=3)
-> Append (cost=0.42..352031.07 rows=1071346 width=8) (actual time=15.286..524.604 rows=837204 loops=3)
-> Parallel Index Scan using data_date_idx on data (cost=0.42..8.44 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=3)
Index Cond: ((date >= '2018-12-01'::date) AND (date <= '2018-12-31'::date))
-> Parallel Seq Scan on data_y2018q4 (cost=0.00..352022.64 rows=1071345 width=8) (actual time=15.282..465.859 rows=837204 loops=3)
Filter: ((date >= '2018-12-01'::date) AND (date <= '2018-12-31'::date))
Rows Removed by Filter: 1479844
Planning time: 1.437 ms
Execution time: 578.465 ms
(13 rows)
We may see that there is Parallel Seq Scan on data_y2018q4. In fact it is normal to me. I have one quarter partition. I am querying third part of the whole partition, so I have seq scan, great.
But now let's query directly partition table:
justpremium=> EXPLAIN analyze SELECT sum(col_1) FROM data_y2018q4 WHERE "date" BETWEEN '2018-12-01' AND '2018-12-31';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=286475.38..286475.39 rows=1 width=32) (actual time=277.830..277.830 rows=1 loops=1)
-> Gather (cost=286475.16..286475.37 rows=2 width=32) (actual time=277.760..279.194 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=285475.16..285475.17 rows=1 width=32) (actual time=275.950..275.950 rows=1 loops=3)
-> Parallel Index Scan using data_y2018q4_date_idx on data_y2018q4 (cost=0.43..282796.80 rows=1071345 width=8) (actual time=0.022..227.687 rows=837204 loops=3)
Index Cond: ((date >= '2018-12-01'::date) AND (date <= '2018-12-31'::date))
Planning time: 0.187 ms
Execution time: 279.233 ms
(9 rows)
Now I have Index Scan using data_y2018q4_date_idx and also time of whole query is two times quicker: 279.233 ms compared to 578.465 ms. What is the explanation of this? How force planner to use the index scan when querying data table. How to achieve two times better timing?

Order by ASC 100x faster than Order by DESC ? Why?

I have one complexe query generated by Hibernate for JBPM. I can't really modify it and i'm searching to optimize it as much as possible.
I found out that ORDER BY DESC is way slower than ORDER BY ASC, do you have any idea ?
PostgreSQL Version : 9.4
Schema : https://pastebin.com/qNZhrbef
Query :
select
taskinstan0_.ID_ as ID1_27_,
taskinstan0_.VERSION_ as VERSION3_27_,
taskinstan0_.NAME_ as NAME4_27_,
taskinstan0_.DESCRIPTION_ as DESCRIPT5_27_,
taskinstan0_.ACTORID_ as ACTORID6_27_,
taskinstan0_.CREATE_ as CREATE7_27_,
taskinstan0_.START_ as START8_27_,
taskinstan0_.END_ as END9_27_,
taskinstan0_.DUEDATE_ as DUEDATE10_27_,
taskinstan0_.PRIORITY_ as PRIORITY11_27_,
taskinstan0_.ISCANCELLED_ as ISCANCE12_27_,
taskinstan0_.ISSUSPENDED_ as ISSUSPE13_27_,
taskinstan0_.ISOPEN_ as ISOPEN14_27_,
taskinstan0_.ISSIGNALLING_ as ISSIGNA15_27_,
taskinstan0_.ISBLOCKING_ as ISBLOCKING16_27_,
taskinstan0_.LOCKED as LOCKED27_,
taskinstan0_.QUEUE as QUEUE27_,
taskinstan0_.TASK_ as TASK19_27_,
taskinstan0_.TOKEN_ as TOKEN20_27_,
taskinstan0_.PROCINST_ as PROCINST21_27_,
taskinstan0_.SWIMLANINSTANCE_ as SWIMLAN22_27_,
taskinstan0_.TASKMGMTINSTANCE_ as TASKMGM23_27_
from JBPM_TASKINSTANCE taskinstan0_, JBPM_VARIABLEINSTANCE stringinst1_, JBPM_PROCESSINSTANCE processins2_, JBPM_VARIABLEINSTANCE variablein3_
where stringinst1_.CLASS_='S'
and taskinstan0_.PROCINST_=processins2_.ID_
and taskinstan0_.ID_=variablein3_.TASKINSTANCE_
and variablein3_.NAME_ = 'NIR'
and taskinstan0_.QUEUE = 'ERT_TPS'
and (processins2_.ORGAPATH_ like '/ERT%')
and taskinstan0_.ISOPEN_= 't'
and variablein3_.ID_=stringinst1_.ID_
order by stringinst1_.STRINGVALUE_ ASC limit '10';
Explain result for ASC :
Limit (cost=1.71..11652.93 rows=10 width=646) (actual time=6.588..82.407 rows=10 loops=1)
-> Nested Loop (cost=1.71..6215929.27 rows=5335 width=646) (actual time=6.587..82.402 rows=10 loops=1)
-> Nested Loop (cost=1.29..6213170.78 rows=5335 width=646) (actual time=6.578..82.363 rows=10 loops=1)
-> Nested Loop (cost=1.00..6159814.66 rows=153812 width=13) (actual time=0.537..82.130 rows=149 loops=1)
-> Index Scan Backward using totoidx10 on jbpm_variableinstance stringinst1_ (cost=0.56..558481.07 rows=11199905 width=13) (actual time=0.018..11.914 rows=40182 loops=1)
Filter: (class_ = 'S'::bpchar)
-> Index Scan using jbpm_variableinstance_pkey on jbpm_variableinstance variablein3_ (cost=0.43..0.49 rows=1 width=16) (actual time=0.002..0.002 rows=0 loops=40182)
Index Cond: (id_ = stringinst1_.id_)
Filter: ((name_)::text = 'NIR'::text)
Rows Removed by Filter: 1
-> Index Scan using jbpm_taskinstance_pkey on jbpm_taskinstance taskinstan0_ (cost=0.29..0.34 rows=1 width=641) (actual time=0.001..0.001 rows=0 loops=149)
Index Cond: (id_ = variablein3_.taskinstance_)
Filter: (isopen_ AND ((queue)::text = 'ERT_TPS'::text))
Rows Removed by Filter: 0
-> Index Only Scan using idx_procin_2 on jbpm_processinstance processins2_ (cost=0.42..0.51 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=10)
Index Cond: (id_ = taskinstan0_.procinst_)
Filter: ((orgapath_)::text ~~ '/ERT%'::text)
Heap Fetches: 0
Planning time: 2.598 ms
Execution time: 82.513 ms
Explain result for DESC :
Limit (cost=1.71..11652.93 rows=10 width=646) (actual time=8144.871..8144.986 rows=10 loops=1)
-> Nested Loop (cost=1.71..6215929.27 rows=5335 width=646) (actual time=8144.870..8144.984 rows=10 loops=1)
-> Nested Loop (cost=1.29..6213170.78 rows=5335 width=646) (actual time=8144.858..8144.951 rows=10 loops=1)
-> Nested Loop (cost=1.00..6159814.66 rows=153812 width=13) (actual time=8144.838..8144.910 rows=20 loops=1)
-> Index Scan using totoidx10 on jbpm_variableinstance stringinst1_ (cost=0.56..558481.07 rows=11199905 width=13) (actual time=0.066..2351.727 rows=2619671 loops=1)
Filter: (class_ = 'S'::bpchar)
Rows Removed by Filter: 906237
-> Index Scan using jbpm_variableinstance_pkey on jbpm_variableinstance variablein3_ (cost=0.43..0.49 rows=1 width=16) (actual time=0.002..0.002 rows=0 loops=2619671)
Index Cond: (id_ = stringinst1_.id_)
Filter: ((name_)::text = 'NIR'::text)
Rows Removed by Filter: 1
-> Index Scan using jbpm_taskinstance_pkey on jbpm_taskinstance taskinstan0_ (cost=0.29..0.34 rows=1 width=641) (actual time=0.002..0.002 rows=0 loops=20)
Index Cond: (id_ = variablein3_.taskinstance_)
Filter: (isopen_ AND ((queue)::text = 'ERT_TPS'::text))
-> Index Only Scan using idx_procin_2 on jbpm_processinstance processins2_ (cost=0.42..0.51 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=10)
Index Cond: (id_ = taskinstan0_.procinst_)
Filter: ((orgapath_)::text ~~ '/ERT%'::text)
Heap Fetches: 0
Planning time: 2.080 ms
Execution time: 8145.053 ms
Tables infos :
jbpm_variableinstance 12100592 rows
jbpm_taskinstance 69913 rows
jbpm_processinstance 97546 rows
If you have any idea, thanks

This typically only happens when OFFSET and / or LIMIT are involved (as is the case here).
The key difference is this line in the EXPLAIN output for the query with DESC:
Rows Removed by Filter: 906237
Meaning that while the first 10 rows in the index totoidx10 match when scanning backwards (which matches your ASC ordering, obviously), Postgres has to filter ~ 900k rows before it finally finds qualifying rows when scanning the same index forward.
A matching multicolumn index (with the right sort order) might help a lot.
Or, since Postgres chooses an unfavorable query plan, maybe just updated (or more detailed) table statistics or cost settings.
Related:
Keep PostgreSQL from sometimes choosing a bad query plan
Optimizing queries on a range of timestamps (two columns)

Postgres Query Optimization w/ simple join

I have the following query:
SELECT "person_dimensions"."dimension"
FROM "person_dimensions"
join users
on users.id = person_dimensions.user_id
where users.team_id = 2
The following is the result of EXPLAIN ANALYZE:
Nested Loop (cost=0.43..93033.84 rows=452 width=11) (actual time=1245.321..42915.426 rows=827 loops=1)
-> Seq Scan on person_dimensions (cost=0.00..254.72 rows=13772 width=15) (actual time=0.022..9.907 rows=13772 loops=1)
-> Index Scan using users_pkey on users (cost=0.43..6.73 rows=1 width=4) (actual time=2.978..3.114 rows=0 loops=13772)
Index Cond: (id = person_dimensions.user_id)
Filter: (team_id = 2)
Rows Removed by Filter: 1
Planning time: 0.396 ms
Execution time: 42915.678 ms
Indexes exist on person_dimensions.user_id and users.team_id, so it is unclear as to why this seemingly simple query would be taking so long.
Maybe it has something to do with team_id being unable to be used in the join condition? Ideas how to speed this up?
EDIT:
I tried this query:
SELECT "person_dimensions"."dimension"
FROM "person_dimensions"
join users ON users.id = person_dimensions.user_id
WHERE users.id IN (2337,2654,3501,56,4373,1060,3170,97,4629,41,3175,4541,2827)
which contains the id's returned by the subquery:
SELECT id FROM users WHERE team_id = 2
The result was 380ms versus 42s as above. I could use this as a workaround, but I am really curious as to what is going on here...

I rebooted my DB server yesterday, and when it came back up this same query was performing as expected with a completely different query plan that used expected indices:
QUERY PLAN
Hash Join (cost=1135.63..1443.45 rows=84 width=11) (actual time=0.354..6.312 rows=835 loops=1)
Hash Cond: (person_dimensions.user_id = users.id)
-> Seq Scan on person_dimensions (cost=0.00..255.17 rows=13817 width=15) (actual time=0.002..2.764 rows=13902 loops=1)
-> Hash (cost=1132.96..1132.96 rows=214 width=4) (actual time=0.175..0.175 rows=60 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 11kB
-> Bitmap Heap Scan on users (cost=286.07..1132.96 rows=214 width=4) (actual time=0.032..0.157 rows=60 loops=1)
Recheck Cond: (team_id = 2)
Heap Blocks: exact=68
-> Bitmap Index Scan on index_users_on_team_id (cost=0.00..286.02 rows=214 width=0) (actual time=0.021..0.021 rows=82 loops=1)
Index Cond: (team_id = 2)
Planning time: 0.215 ms
Execution time: 6.474 ms
Anyone have any ideas why it required a reboot to be aware of all of this? Could it be that manual vacuums were required that hadn't been done in a while, or something like this? Recall I did do an analyze on the relevant tables before the reboot and it didn't change anything.

special characters in postgres _

I have a query like this which use index on call_id while if I add _ in value then it changes from index search to seq search.
explain analyze
DELETE
FROM completedcalls
WHERE call_id like '560738a563616c6c004c7621#198.148.114.67-b2b1';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Delete on completedcalls (cost=0.00..8.67 rows=1 width=6) (actual time=0.036..0.036 rows=0 loops=1)
-> Index Scan using i_call_id on completedcalls (cost=0.00..8.67 rows=1 width=6) (actual time=0.034..0.034 rows=0 loops=1)
Index Cond: ((call_id)::text = '560738a563616c6c004c7621#198.148.114.67-b2b1'::text)
Filter: ((call_id)::text ~~ '560738a563616c6c004c7621#198.148.114.67-b2b1'::text)
Total runtime: 0.069 ms
(5 rows)
This statement:
explain analyze
DELETE
FROM completedcalls
WHERE call_id like '560738a563616c6c004c7621#198.148.114.67-b2b_1';
Returns this execution plan:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Delete on completedcalls (cost=0.00..39548.64 rows=84 width=6) (actual time=194.313..194.313 rows=0 loops=1)
-> Seq Scan on completedcalls (cost=0.00..39548.64 rows=84 width=6) (actual time=194.310..194.310 rows=0 loops=1)
Filter: ((call_id)::text ~~ '560738a563616c6c004c7621#198.148.114.67-b2b_1'::text)
Total runtime: 194.349 ms
(4 rows)
My Question is how to escape these characters in query. Using psycopg2 in python.

You would need to escape the _ with a backslash, like so:
DELETE FROM completedcalls
WHERE call_id like '560738a563616c6c004c7621#198.148.114.67-b2b\_1';
Also, if you don't want pattern matching, it would probably make more sense to use = instead of LIKE.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Postgresql array function doesn't use index - postgresql

Related

PostgreSQL: Sequential scan despite having indexes

Partition and Indexes

Order by ASC 100x faster than Order by DESC ? Why?

Postgres Query Optimization w/ simple join

special characters in postgres _

Categories

Resources