Why do different query values produce different index algorithms? - postgresql

I have a query and I have created indexes specially for this query. But I just discovered that if I use some particular values, the query stops being running fast, and does a full scan.
Here is the case of fast execution:
explain analyze SELECT
v.valtr_id,
v.block_num,
v.from_id,
v.to_id,
v.from_balance::text,
v.to_balance::text
FROM value_transfer v
WHERE
(v.block_num<=2748053) AND
(
(v.to_id=639291) OR
(v.from_id=639291)
)
ORDER BY
v.block_num DESC,v.valtr_id DESC
LIMIT 1
Limit (cost=23054.03..23054.03 rows=1 width=30) (actual time=1.464..1.465 rows=1 loops=1)
-> Sort (cost=23054.03..23068.94 rows=5964 width=30) (actual time=1.462..1.462 rows=1 loops=1)
Sort Key: block_num DESC, valtr_id DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on value_transfer v (cost=144.85..23024.21 rows=5964 width=30) (actual time=1.397..1.437 rows=3 loops=1)
Recheck Cond: ((to_id = 639291) OR (from_id = 639291))
Filter: (block_num <= 2748053)
Heap Blocks: exact=3
-> BitmapOr (cost=144.85..144.85 rows=5964 width=0) (actual time=1.339..1.339 rows=0 loops=1)
-> Bitmap Index Scan on vt_to_id_idx (cost=0.00..40.42 rows=1580 width=0) (actual time=0.755..0.755 rows=1 loops=1)
Index Cond: (to_id = 639291)
-> Bitmap Index Scan on vt_from_id_idx (cost=0.00..101.45 rows=4384 width=0) (actual time=0.580..0.580 rows=2 loops=1)
Index Cond: (from_id = 639291)
Planning time: 0.499 ms
Execution time: 1.556 ms
(15 rows)
But if I put the value 199658 as input to my query, it uses different search algorithm:
explain analyze SELECT
v.valtr_id,
v.block_num,
v.from_id,
v.to_id,
v.from_balance::text,
v.to_balance::text
FROM value_transfer v
WHERE
(v.block_num<=2748053) AND
(
(v.to_id=199658) OR
(v.from_id=199658)
)
ORDER BY
v.block_num DESC,v.valtr_id DESC
LIMIT 1 ;
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..6462.99 rows=1 width=30) (actual time=614109.855..614109.856 rows=1 loops=1)
-> Index Scan Backward using bnum_valtr_idx on value_transfer v (cost=0.57..200845479.66 rows=31079 width=30) (actual time=614109.853..614109.853 rows=1 loops=1)
Index Cond: (block_num <= 2748053)
Filter: ((to_id = 199658) OR (from_id = 199658))
Rows Removed by Filter: 101190609
Planning time: 0.515 ms
Execution time: 614109.920 ms
(7 rows)
Why is this happening? I thought that once you have created the indexes for your query , the execution will take the same path always, but it is not the case. How can I make sure Postgres always uses the same algorithm in every search?
I even thought this happens because probably the indexes weren't built cleanly and rebuilt the main index:
postgres=> drop index bnum_valtr_idx;
DROP INDEX
postgres=> CREATE INDEX bnum_valtr_idx ON public.value_transfer USING btree (block_num DESC, valtr_id DESC);
CREATE INDEX
postgres=>
however, this didn't change anything.
My table definitions are:
CREATE TABLE value_transfer (
valtr_id BIGSERIAL PRIMARY KEY,
tx_id BIGINT REFERENCES transaction(tx_id) ON DELETE CASCADE ON UPDATE CASCADE,
block_id INT REFERENCES block(block_id) ON DELETE CASCADE ON UPDATE CASCADE,
block_num INT NOT NULL,
from_id INT NOT NULL,
to_id INT NOT NULL,
value NUMERIC DEFAULT 0,
from_balance NUMERIC DEFAULT 0,
to_balance NUMERIC DEFAULT 0,
kind CHAR NOT NULL,
depth INT DEFAULT 0,
error TEXT NOT NULL
);
postgres=> SELECT * FROM pg_indexes WHERE tablename = 'value_transfer';
schemaname | tablename | indexname | tablespace | indexdef
------------+----------------+---------------------+------------+--------------------------------------------------------------------------------------------------
public | value_transfer | bnum_valtr_idx | | CREATE INDEX bnum_valtr_idx ON public.value_transfer USING btree (block_num DESC, valtr_id DESC)
public | value_transfer | value_transfer_pkey | | CREATE UNIQUE INDEX value_transfer_pkey ON public.value_transfer USING btree (valtr_id)
public | value_transfer | vt_tx_from_idx | | CREATE INDEX vt_tx_from_idx ON public.value_transfer USING btree (tx_id)
public | value_transfer | vt_block_num_idx | | CREATE INDEX vt_block_num_idx ON public.value_transfer USING btree (block_num)
public | value_transfer | vt_from_id_idx | | CREATE INDEX vt_from_id_idx ON public.value_transfer USING btree (from_id)
public | value_transfer | vt_to_id_idx | | CREATE INDEX vt_to_id_idx ON public.value_transfer USING btree (to_id)
public | value_transfer | vt_block_id_idx | | CREATE INDEX vt_block_id_idx ON public.value_transfer USING btree (block_id)
(7 rows)
postgres=>

It could could be that one value is in one column and visa versa. Regardless, an OR over different columns is notorious for causing performance problems because the query plan can only use one index, but the OR would require two indexes to be used to check both columns quickly, so one column will be checked using its index, but the other requires a scan.
The way around this problem is to break the query into a union.
Try this:
SELECT * FROM (
SELECT
valtr_id,
block_num,
from_id,
to_id,
from_balance::text,
to_balance::text
FROM value_transfer
WHERE block_num<=2748053
AND to_id=199658
UNION ALL
SELECT
valtr_id,
block_num,
from_id,
to_id,
from_balance::text,
to_balance::text
FROM value_transfer
WHERE block_num<=2748053
AND from_id=199658
) x
ORDER BY block_num DESC, valtr_id DESC
LIMIT 1

Related

Why is a MAX query with an equality filter on one other column so slow in Postgresql?

I'm running into an issue in PostgreSQL (version 9.6.10) with indexes not working to speed up a MAX query with a simple equality filter on another column. Logically it seems that a simple multicolumn index on (A, B DESC) should make the query super fast.
I can't for the life of me figure out why I can't get a query to be performant regardless of what indexes are defined.
The table definition has the following:
- A primary key foo VARCHAR PRIMARY KEY (not used in the query)
- A UUID field that is NOT NULL called bar UUID
- A sequential_id column that was created as a BIGSERIAL UNIQUE type
Here's what the relevant columns look like exactly (with names modified for privacy):
Table "public.foo"
Column | Type | Modifiers
----------------------+--------------------------+--------------------------------------------------------------------------------
foo_uid | character varying | not null
bar_uid | uuid | not null
sequential_id | bigint | not null default nextval('foo_sequential_id_seq'::regclass)
Indexes:
"foo_pkey" PRIMARY KEY, btree (foo_uid)
"foo_bar_uid_sequential_id_idx", btree (bar_uid, sequential_id DESC)
"foo_sequential_id_key" UNIQUE CONSTRAINT, btree (sequential_id)
Despite having the index listed above on (bar_uid, sequential_id DESC), the following query requires an index scan and takes 100-300ms with a few million rows in the database.
The Query (get the max sequential_id for a given bar_uid):
SELECT MAX(sequential_id)
FROM foo
WHERE bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f';
The EXPLAIN ANALYZE result doesn't use the proper index. Also, for some reason it checks if sequential_id IS NOT NULL even though it's declared as not null.
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.75..0.76 rows=1 width=8) (actual time=321.110..321.110 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.43..0.75 rows=1 width=8) (actual time=321.106..321.106 rows=1 loops=1)
-> Index Scan Backward using foo_sequential_id_key on foo (cost=0.43..98936.43 rows=308401 width=8) (actual time=321.106..321.106 rows=1 loops=1)
Index Cond: (sequential_id IS NOT NULL)
Filter: (bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'::uuid)
Rows Removed by Filter: 920761
Planning time: 0.196 ms
Execution time: 321.127 ms
(9 rows)
I can add a seemingly unnecessary GROUP BY to this query, and that speeds it up a bit, but it's still really slow for a query that should be near instantaneous with indexes defined:
SELECT MAX(sequential_id)
FROM foo
WHERE bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'
GROUP BY bar_uid;
The EXPLAIN (ANALYZE, BUFFERS) result:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=8510.54..65953.61 rows=6 width=24) (actual time=234.529..234.530 rows=1 loops=1)
Group Key: bar_uid
Buffers: shared hit=1 read=11909
-> Bitmap Heap Scan on foo (cost=8510.54..64411.55 rows=308401 width=24) (actual time=65.259..201.969 rows=309023 loops=1)
Recheck Cond: (bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'::uuid)
Heap Blocks: exact=10385
Buffers: shared hit=1 read=11909
-> Bitmap Index Scan on foo_bar_uid_sequential_id_idx (cost=0.00..8433.43 rows=308401 width=0) (actual time=63.549..63.549 rows=309023 loops=1)
Index Cond: (bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'::uuid)
Buffers: shared read=1525
Planning time: 3.067 ms
Execution time: 234.589 ms
(12 rows)
Does anyone have any idea what's blocking this query from being on the order of 10 milliseconds? This should logically be instantaneous with the right index defined. It should only require the time to follow links to the leaf value in the B-Tree.
Someone asked:
What do you get for SELECT * FROM pg_stats WHERE tablename = 'foo' and attname = 'bar_uid';?
schemaname | tablename | attname | inherited | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation | most_common_elems | most_common_elem_freqs | elem_count_histogram
------------+------------------------+-------------+-----------+-----------+-----------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------+------------------+-------------+-------------------+------------------------+----------------------
public | foo | bar_uir | f | 0 | 16 | 6 | {fa61424d-389f-4e75-ba2d-b77e6bb8491f,5c5dcae9-1b7e-4413-99a1-62fde2b89c32,50b1e842-fc32-4c2c-b00f-4a17c3c1c5fa,7ff1999c-c0ea-b700-343f-9a737f6ad659,f667b353-e199-4890-9ffd-4940ea11fe2c,b24ce968-29fd-4587-ba1f-227036ee3135} | {0.203733,0.203167,0.201567,0.195867,0.1952,0.000466667} | | -0.158093 | | |
(1 row)

slow order by "field" and limit

I have simple query that must get 1 record from table with about 14m records:
EXPLAIN ANALYZE SELECT "projects_toolresult"."id",
"projects_toolresult"."tool_id",
"projects_toolresult"."status",
"projects_toolresult"."updated_at",
"projects_toolresult"."created_at" FROM
"projects_toolresult" WHERE
("projects_toolresult"."status" = 1 AND
"projects_toolresult"."tool_id" = 21)
ORDER BY "projects_toolresult"."updated_at"
DESC LIMIT 1;
And it is weird that when I order query by updated_at field my query executes 60 sec.
Limit (cost=0.43..510.94 rows=1 width=151) (actual
time=56754.932..56754.932 rows=0 loops=1)
-> Index Scan using projects_to_updated_266459_idx on projects_toolresult (cost=0.43..1800549.09 rows=3527 width=151) (actual time=56754.930..56754.930 rows=0 loops=1)
Filter: ((status = 1) AND (tool_id = 21))
Rows Removed by Filter: 13709343 Planning time: 0.236 ms Execution time: 56754.968 ms (6 rows)
No matter if it will be ASC or DESC
But if I do ORDER BY RAND() or without order:
Limit (cost=23496.10..23496.10 rows=1 width=151) (actual time=447.532..447.532 rows=0 loops=1)
-> Sort (cost=23496.10..23505.20 rows=3642 width=151) (actual time=447.530..447.530 rows=0 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 25kB
-> Index Scan using projects_toolresult_tool_id_34a3bb16 on projects_toolresult (cost=0.56..23477.89 rows=3642 width=151) (actual time=447.513..447.513 rows=0 loops=1)
Index Cond: (tool_id = 21)
Filter: (status = 1)
Rows Removed by Filter: 6097
Planning time: 0.224 ms
Execution time: 447.571 ms
(10 rows)
It working fast.
I have index on updated_at and status fields(I also tried without too). I did upgrade for default postgres settings, increased values with this generator: https://pgtune.leopard.in.ua/#/
And this is what happens when this queries in action.
Postgres version 9.5
My table and indexes:
id | integer | not null default nextval('projects_toolresult_id_seq'::regclass)
status | smallint | not null
object_id | integer | not null
created_at | timestamp with time zone | not null
content_type_id | integer | not null
tool_id | integer | not null
updated_at | timestamp with time zone | not null
output_data | text | not null
Indexes:
"projects_toolresult_pkey" PRIMARY KEY, btree (id)
"projects_toolresult_content_type_id_object_i_71ee2c2e_uniq" UNIQUE CONSTRAINT, btree (content_type_id, object_id, tool_id)
"projects_to_created_cee389_idx" btree (created_at)
"projects_to_tool_id_ec7856_idx" btree (tool_id, status)
"projects_to_updated_266459_idx" btree (updated_at)
"projects_toolresult_content_type_id_9924d905" btree (content_type_id)
"projects_toolresult_tool_id_34a3bb16" btree (tool_id)
Check constraints:
"projects_toolresult_object_id_check" CHECK (object_id >= 0)
"projects_toolresult_status_check" CHECK (status >= 0)
Foreign-key constraints:
"projects_toolresult_content_type_id_9924d905_fk_django_co" FOREIGN KEY (content_type_id) REFERENCES django_content_type(id) DEFERRABLE INITIALLY DEFERRED
"projects_toolresult_tool_id_34a3bb16_fk_projects_tool_id" FOREIGN KEY (tool_id) REFERENCES projects_tool(id) DEFERRABLE INITIALLY DEFERRED
You are filtering your data on status and tool_id, and sorting on updated_at but you have no single index for all three of those columns.
Add an index, like so:
CREATE INDEX ON projects_toolresult (status, tool_id, updated_at);

How to tell Postgres to use an index instead of a bitmap scan?

If I add an extra ORDER clause, Postgres does a Bitmap scan, and the performance drops amazingly (4 seconds vs 0.06 milliseconds). The application becomes unusable. However I am only asking it to order a small set of results, which are indexed , by the way.
How should modify my query so Postgres uses the index instead of bitmap scan? Because using the index is what it should do, I have 100 million records in the table.
Slow query, Bitmap scan:
EXPLAIN ANALYZE
SELECT
valtr_id,
from_id,
to_id,
from_balance,
to_balance,
block_num
FROM value_transfer v
WHERE v.block_num<=2435013 AND v.to_id = 22479
ORDER BY block_num desc,valtr_id desc
LIMIT 1
Limit (cost=1235402.27..1235402.27 rows=1 width=32) (actual time=4665.595..4665.596 rows=1 loops=1)
-> Sort (cost=1235402.27..1238237.41 rows=1134056 width=32) (actual time=4665.594..4665.594 rows=1 loops=1)
Sort Key: block_num DESC, valtr_id DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on value_transfer v (cost=21229.61..1229731.99 rows=1134056 width=32) (actual time=268.917..4170.374 rows=1102867 loops=1)
Recheck Cond: (to_id = 22479)
Rows Removed by Index Recheck: 9412580
Filter: (block_num <= 2435013)
Heap Blocks: exact=32392 lossy=132879
-> Bitmap Index Scan on vt_to_id_idx (cost=0.00..20946.10 rows=1134071 width=0) (actual time=254.870..254.870 rows=1102867 loops=1)
Index Cond: (to_id = 22479)
Planning time: 0.290 ms
Execution time: 4665.634 ms
(13 rows)
Now, if I remove just one ORDER condition, the query is orders of magnitude faster.
Without ORDER by valtr_id DESC I have the following performance:
EXPLAIN ANALYZE
SELECT
valtr_id,
from_id,
to_id,
from_balance,
to_balance,
block_num
FROM value_transfer v
WHERE v.block_num<=2435013 AND v.to_id = 22479
ORDER BY block_num desc
LIMIT 1
Limit (cost=0.57..2.46 rows=1 width=32) (actual time=0.028..0.028 rows=1 loops=1)
-> Index Scan using idx_2 on value_transfer v (cost=0.57..2148650.88 rows=1134056 width=32) (actual time=0.027..0.027 rows=1 loops=1)
Index Cond: ((to_id = 22479) AND (block_num <= 2435013))
Planning time: 0.310 ms
Execution time: 0.060 ms
(5 rows)
How do i tell Postgres to use the INDEX first, and only after that - SORT the results ?
My table is defined like this:
CREATE TABLE value_transfer (
valtr_id BIGSERIAL PRIMARY KEY,
tx_id BIGINT REFERENCES transaction(tx_id) ON DELETE CASCADE ON UPDATE CASCADE,
block_id INT REFERENCES block(block_id) ON DELETE CASCADE ON UPDATE CASCADE,
block_num INT NOT NULL,
from_id INT NOT NULL,
to_id INT NOT NULL,
value NUMERIC DEFAULT 0,
from_balance NUMERIC DEFAULT 0,
to_balance NUMERIC DEFAULT 0,
kind CHAR NOT NULL,
depth INT DEFAULT 0,
error TEXT NOT NULL
);
I have created lots of different indexes during my tests:
indexname | indexdef
---------------------+-----------------------------------------------------------------------------------------
value_transfer_pkey | CREATE UNIQUE INDEX value_transfer_pkey ON public.value_transfer USING btree (valtr_id)
vt_block_id_idx | CREATE INDEX vt_block_id_idx ON public.value_transfer USING btree (block_id)
vt_block_num_idx | CREATE INDEX vt_block_num_idx ON public.value_transfer USING btree (block_num)
vt_from_id_idx | CREATE INDEX vt_from_id_idx ON public.value_transfer USING btree (from_id)
vt_to_id_idx | CREATE INDEX vt_to_id_idx ON public.value_transfer USING btree (to_id)
vt_tx_from_idx | CREATE INDEX vt_tx_from_idx ON public.value_transfer USING btree (tx_id)
idx_1 | CREATE INDEX idx_1 ON public.value_transfer USING btree (from_id, block_num DESC)
idx_2 | CREATE INDEX idx_2 ON public.value_transfer USING btree (to_id, block_num DESC)
idx_1_rev | CREATE INDEX idx_1_rev ON public.value_transfer USING btree (block_num DESC, from_id)
idx_2_rev | CREATE INDEX idx_2_rev ON public.value_transfer USING btree (block_num DESC, to_id)
valtr_ordered_idx | CREATE INDEX valtr_ordered_idx ON public.value_transfer USING btree (valtr_id)
(11 rows)

Not Sure if Postgresql Cube Gist Index is working

I'm trying to figure out if my GIST index on my cube column for my table is working for my nearest neighbors query (metric = Euclidean). My cube values are 75 dimensional vectors.
Table:
\d+ reduced_features
Table "public.reduced_features"
Column | Type | Modifiers | Storage | Stats target | Description
----------+--------+-----------+---------+--------------+-------------
id | bigint | not null | plain | |
features | cube | not null | plain | |
Indexes:
"reduced_features_id_idx" UNIQUE, btree (id)
"reduced_features_features_idx" gist (features)
Here is my query:
explain analyze select id from reduced_features order by features <-> (select features from reduced_features where id = 198990) limit 10;
Results:
QUERY PLAN
---------------
Limit (cost=8.58..18.53 rows=10 width=16) (actual time=0.821..35.987 rows=10 loops=1)
InitPlan 1 (returns $0)
-> Index Scan using reduced_features_id_idx on reduced_features reduced_features_1 (cost=0.29..8.31 rows=1 width=608) (actual time=0.014..0.015 rows=1 loops=1)
Index Cond: (id = 198990)
-> Index Scan using reduced_features_features_idx on reduced_features (cost=0.28..36482.06 rows=36689 width=16) (actual time=0.819..35.984 rows=10 loops=1)
Order By: (features <-> $0)
Planning time: 0.117 ms
Execution time: 36.232 ms
I have 36689 total records in my table. Is my index working?

Storing 'ties' for Contests in Postgres

I'm trying to determine if there a "low cost" optimization for the following query. We've implemented a system whereby 'tickets' earn 'points' and thus can be ranked. In order to support analytical type of queries, we store the rank of every ticket and wether the ticket is tied along with the ticket.
I've found that, at scale, storing the is_tied field is very slow. I'm attempting to run the scenario below on a set of "tickets" that is about 20-75k tickets.
I'm hoping that someone can help identify why and offer some help.
We're on postgres 9.3.6
Here's a simplified ticket table schema:
ogs_1=> \d api_ticket
Table "public.api_ticket"
Column | Type | Modifiers
------------------------------+--------------------------+---------------------------------------------------------
id | integer | not null default nextval('api_ticket_id_seq'::regclass)
status | character varying(3) | not null
points_earned | integer | not null
rank | integer | not null
event_id | integer | not null
user_id | integer | not null
is_tied | boolean | not null
Indexes:
"api_ticket_pkey" PRIMARY KEY, btree (id)
"api_ticket_4437cfac" btree (event_id)
"api_ticket_e8701ad4" btree (user_id)
"api_ticket_points_earned_idx" btree (points_earned)
"api_ticket_rank_idx" btree ("rank")
Foreign-key constraints:
"api_ticket_event_id_598c97289edc0e3e_fk_api_event_id" FOREIGN KEY (event_id) REFERENCES api_event(id) DEFERRABLE INITIALLY DEFERRED
(user_id) REFERENCES auth_user(id) DEFERRABLE INITIALLY DEFERRED
Here's the query that I'm executing:
UPDATE api_ticket t SET is_tied = False
WHERE t.event_id IN (SELECT id FROM api_event WHERE status = 'c');
UPDATE api_ticket t SET is_tied = True
FROM (
SELECT event_id, rank
FROM api_ticket tt
WHERE event_id in (SELECT id FROM api_event WHERE status = 'c')
AND tt.status <> 'x'
GROUP BY rank, event_id
HAVING count(*) > 1
) AS tied_tickets
WHERE t.rank = tied_tickets.rank AND
tied_tickets.event_id = t.event_id;
Here's the explain on a set of about 35k rows:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Update on api_ticket t (cost=3590.01..603570.21 rows=157 width=128)
-> Nested Loop (cost=3590.01..603570.21 rows=157 width=128)
-> Subquery Scan on tied_tickets (cost=2543.31..2556.18 rows=572 width=40)
-> HashAggregate (cost=2543.31..2550.46 rows=572 width=8)
Filter: (count(*) > 1)
-> Nested Loop (cost=0.84..2539.02 rows=572 width=8)
-> Index Scan using api_event_status_idx1 on api_event (cost=0.29..8.31 rows=1 width=4)
Index Cond: ((status)::text = 'c'::text)
-> Index Scan using api_ticket_4437cfac on api_ticket tt (cost=0.55..2524.99 rows=572 width=8)
Index Cond: (event_id = api_event.id)
Filter: ((status)::text <> 'x'::text)
-> Bitmap Heap Scan on api_ticket t (cost=1046.70..1050.71 rows=1 width=92)
Recheck Cond: (("rank" = tied_tickets."rank") AND (event_id = tied_tickets.event_id))
-> BitmapAnd (cost=1046.70..1046.70 rows=1 width=0)
-> Bitmap Index Scan on api_ticket_rank_idx (cost=0.00..26.65 rows=708 width=0)
Index Cond: ("rank" = tied_tickets."rank")
-> Bitmap Index Scan on api_ticket_4437cfac (cost=0.00..1019.79 rows=573 width=0)
Index Cond: (event_id = tied_tickets.event_id)