Multiple index available in where clause, which one Postgres chooses? - postgresql

I have table
CREATE TABLE IF NOT EXISTS employeedb.employee_tbl
(
id integer NOT NULL GENERATED ALWAYS AS IDENTITY ( INCREMENT 1 START 1 MINVALUE 1 MAXVALUE 2147483647 CACHE 1 ),
dm_company_id integer NOT NULL,
customer_employee_id text NOT NULL),
personal_mobile text,
CONSTRAINT employee_pkey PRIMARY KEY (id),
CONSTRAINT employee_id_company_constraint UNIQUE (dm_company_id, customer_employee_id)
and index on personal_mobile like
CREATE INDEX company_id_personal_mobile_index ON employeedb.employee_tbl (dm_company_id, personal_mobile COLLATE "C");
Now when I hit following query
explain analyse select * from employeedb.employee_tbl where dm_company_id = 2011 and employee_tbl.personal_mobile like '+1%';
I always see employee_id_company_constraint index is used.
Index Scan using employee_id_company_constraint on employee_tbl (cost=0.14..8.16 rows=1 width=641) (actual time=0.056..0.093 rows=2 loops=1)
Index Cond: (dm_company_id = 2011)
Filter: (personal_mobile ~~ '+1%'::text)
Rows Removed by Filter: 28
Planning Time: 0.951 ms
Execution Time: 0.173 ms
Why Postgres won't use company_id_personal_mobile_index which is more efficient than employee_id_company_constraint for above query ?

There is an important piece of information we can't see from your plan, which is how many rows it thinks will be removed by the filter. That vital data is just not reported.
You can run this query to probe what the planner thinks is going on:
explain analyse select * from employeedb.employee_tbl where dm_company_id = 2011;
We would need to know how many rows it expects to find from this simplified query, and how many it actually finds (but presumably it will actually find 30, which is the 2 found + the 28 filtered from your current plan).
If it thinks that dm_company_id = 2011 will only match 1 row, then it doesn't make sense (according to the planners current way of thinking) to use the more complicated index to rule that single row in or out, it should be faster to just test that one row and see.
If my theory here is the correct, then the question becomes, why does it expect to find 1 row when the real answer is 30? Maybe your stats are out of date, maybe something else is going on. An ANALYZE should fix the first, the 2nd will require more investigation--which isn't warranted until my theory is confirmed and the ANALYZE fails to correct the problem.

Related

Why does postgres use index scan over sequential scan even with a mismatching data type on the indexed column and query condition

I have the following PostgreSQL table:
CREATE TABLE staff (
id integer primary key,
full_name VARCHAR(100) NOT NULL,
department VARCHAR(100) NULL,
tier bigint
);
Filled random data into this table using following block:
do $$
declare
begin
FOR counter IN 1 .. 100000 LOOP
INSERT INTO staff (id, full_name, department, tier)
VALUES (nextval('staff_sequence'),
random_string(10),
random_string(20),
get_department(),
floor(random() * 5 + 1)::bigint);
end LOOP;
end; $$
After the data is populated, I created an index on this table on the tier column:
create index staff_tier_idx on staff(tier);
Although I created this index, when I execute a query using this column, I want this index NOT to be used. To accomplish this, I tried to execute this query:
select count(*) from staff where tier=1::numeric;
Due to mismatching data types on the indexed column and the query condition, I thought the index will not be used & instead a sequential scan will be executed. However, when I run EXPLAIN ANALYZE on the above query I get the following output:
Aggregate (cost=2349.54..2349.55 rows=1 width=8) (actual time=17.078..17.079 rows=1 loops=1)
-> Index Only Scan using staff_tier_idx on staff (cost=0.29..2348.29 rows=500 width=0) (actual time=0.022..15.925 rows=19942 loops=1)
Filter: ((tier)::numeric = '1'::numeric)
Rows Removed by Filter: 80058
Heap Fetches: 0
Planning Time: 0.305 ms
Execution Time: 17.130 ms
Showing that the index has indeed been used.
How do I change this so that the query uses a sequential scan instead of the index? This is purely for a testing/learning purposes.
If its of any importance, I am running this on an Amazon RDS database instance
From the "Filter" rows of the plan like
Rows Removed by Filter: 80058
you can see that the index is not being used as a real index, but just as a skinny table, testing the casted condition for each row. This appears favorable because the index is less than 1/4 the size of the table, while the default ratio of random_page_cost/seq_page_cost = 4.
In addition to just outright disabling index scans as Adrian already suggested, you could also discourage this "skinny table" usage by just increasing random_page_cost, since pages of indexes are assumed to be read in random order.
Another method would be to change the query so it can't use the index-only scan. For example, just using count(full_name) would do that, as PostgreSQL then needs to visit the table to make sure full_name is not NULL (even though it has a constraint asserting that already--sometimes it is not very clever)
Which method is better depends on what it is you are wanting to test/learn.

Postgres TIMESTAMP index and query performance

I have this table:
CREATE TABLE IF NOT EXISTS CHANGE_REQUESTS (
ID UUID PRIMARY KEY,
FIELD_ID INTEGER NOT NULL,
LAST_CHANGE_DATE TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
);
And I'm always going to be running the exact same query on it:
select * from change_requests where last_change_date > now() - INTERVAL '10 min';
The size of the table is going to be anywhere from 750k to 1million rows on average.
My question is how can I make sure the query is always very fast? I'm thinking of adding an index on last_change_date, but I'm not sure if that will do anything. I tried it (with only 1 row in the table right now) and got this explain:
create index change_requests__dt_index
on change_requests (last_change_date);
Seq Scan on change_requests (cost=0.00..1.02 rows=1 width=28)
Filter: (last_change_date > (now() - '00:10:00'::interval))
So it doesn't appear to use the index at all.
Will this index actually help? If not, what else could I do? Thanks!
Your index is perfect for the task. You see the sequential scan in the execution plan because you don't have a realistic amount of test data in the table, and for very small tables the overhead of using the index is not worth the effort (you'd have to process more 8kB database blocks).
Always test with realistic amounts of data. That will safe you some pain later on.

Postgres SLOWER when a LIMIT is set: how to fix besides adding a dummy `ORDER BY`?

In Postgres, some queries are a whole lot slower when adding a LIMIT:
The queries:
SELECT * FROM review WHERE clicker_id=28 ORDER BY done DESC LIMIT 4; -- 51 sec
SELECT * FROM review WHERE clicker_id=28 ORDER BY id, done DESC LIMIT 4; -- 0.020s
SELECT * FROM review WHERE clicker_id=28 LIMIT 4; -- 0.007s
SELECT * FROM review WHERE clicker_id=28 ORDER BY id; -- 0.007s
As you can see, I need to add a dummy id to the ORDER BY in order for things to go fast. I'm trying to understand why.
Running EXPLAIN on them:
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY done DESC LIMIT 4;
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY id, done DESC LIMIT 4;
EXPLAIN SELECT * FROM review WHERE clicker_id=28 LIMIT 4;
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY id;
gives this:
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY done DESC LIMIT 4
Limit (cost=0.44..249.76 rows=4 width=56)
-> Index Scan using review_done on review (cost=0.44..913081.13 rows=14649 width=56)
Filter: (clicker_id = 28)
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY id, done DESC LIMIT 4
Limit (cost=11970.75..11970.76 rows=4 width=56)
-> Sort (cost=11970.75..12007.37 rows=14649 width=56)
Sort Key: id, done DESC
-> Index Scan using review_clicker_id on review (cost=0.44..11751.01 rows=14649 width=56)
Index Cond: (clicker_id = 28)
EXPLAIN SELECT * FROM review WHERE clicker_id=28 LIMIT 4
Limit (cost=0.44..3.65 rows=4 width=56)
-> Index Scan using review_clicker_id on review (cost=0.44..11751.01 rows=14649 width=56)
Index Cond: (clicker_id = 28)
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY id
Sort (cost=12764.61..12801.24 rows=14649 width=56)
Sort Key: id
-> Index Scan using review_clicker_id on review (cost=0.44..11751.01 rows=14649 width=56)
Index Cond: (clicker_id = 28)
I'm no SQL expert, but I take it Postgres expected the query to be faster than it actually is, and so used a way to fetch the data that's actually inappropriate, correct?
The database:
The review table:
Contains 22+ million rows.
A given user will get 7 066 rows tops.
The one in the test (id 28) has 288 at the time.
Has this structure:
id: bigint Auto Increment [nextval('public.review_id_seq')]
type: review_type NULL
iteration: smallint NULL
repetitions: smallint NULL
due: timestamptz NULL
done: timestamptz NULL
added: timestamptz NULL
clicker_id: bigint NULL
monologue_id: bigint NULL
Has these indexes:
UNIQUE type, clicker_id, monologue_id, iteration
INDEX clicker_id
INDEX done, due, monologue_id
INDEX id
INDEX done DESC
INDEX type
Additional details:
Environment:
The queries were ran in development with Postgres 9.6.14.
Running the queries into production (Heroku Postgres, version 9.6.16) the difference is less dramatic, but still not great: the slow queries might take 600 ms.
Variable speed:
Sometimes, the same queries (be it the exact same, or for a different clicker_id) run a lot faster (under 1 sec), but I don't understand why. I need them to be consistently fast.
If I use LIMIT 288 for a user that has 288 rows, then it's so much faster (< 1sec), but if I do the same for a user with say 7066 rows then it's back to super slow.
Before I figured the use of a dummy ORDER BY, I tried these:
Re-importing the database.
analyze review;
Setting the index for done to DESC (used to be set to default/ASC.) [The challenge then was that there's no proper way to check if/when the index is done rebuilding.]
None helped.
The question:
My issue in itself is solved, but I'm dissatisfied with it:
Is there a name for this "pattern" that consists of adding a dummy ORDER BY to speed things up?
How can I spot such issues in the future? (This took ages to figure.) Unless I missed something, the EXPLAIN is not that useful:
For the slow query, the cost is misleadingly slow, while for the fast variant it's misleadingly high.
Alternative: is there another way to handle this? (Because this solution feels like a hack.)
Thanks!
Similar questions:
PostgreSQL query very slow with limit 1 is almost the same question, except his queries were slow with LIMIT 1 and fine with LIMIT 3 or higher. And then of course the data is not the same.
The underlying issue here is what's called an abort-early query plan. Here's a thread from pgsql-hackers describing something similar:
https://www.postgresql.org/message-id/541A2335.3060100%40agliodbs.com
Quoting from there, this is why the planner is using the often-extremely-slow index scan when the ORDER BY done DESC is present:
As usual, PostgreSQL is dramatically undercounting n_distinct: it shows
chapters.user_id at 146,000 and the ratio of to_user_id:from_user_id as
being 1:105 (as opposed to 1:6, which is about the real ratio). This
means that PostgreSQL thinks it can find the 20 rows within the first 2%
of the index ... whereas it actually needs to scan 50% of the index to
find them.
In your case, the planner thinks that if it just starts going through rows ordered by done desc (IOW, using the review_done index), it will find 4 rows with clicker_id=28 quickly. Since the rows need to be returned in "done" descending order, it thinks this will save a sort step and be faster than retrieving all rows for clicker 28 and then sorting. Given the real-world distribution of rows, this can often turn out not to be the case, requiring it to skip a huge number of rows before finding 4 with clicker=28.
A more-general way of handling it is to use a CTE (which, in 9.6, is still an optimization fence - this changes in PG 12, FYI) to fetch the rows without an order by, then add the ORDER BY in the outer query. Given that fetching all rows for a user, sorting them, and returning however many you need is completely reasonable for your dataset (even the 7k-rows clicker shouldn't be an issue), you can prevent the planner from believing an abort-early plan will be fastest by not having an ORDER BY or a LIMIT in the CTE, giving you a query of something like:
WITH clicker_rows as (SELECT * FROM review WHERE clicker_id=28)
select * From clicker_rows ORDER BY done DESC LIMIT 4;
This should be reliably fast while still respecting the ORDER BY and the LIMIT you want. I'm not sure if there's a name for this pattern, though.

How can I optimize this join on timestamps in PostgreSQL

PostgreSQL version 10
Windows 10
16GB RAM
SSD
I'm ashamed to admit that, despite searching the hundred years of PG support archives, I cannot figure out this most basic problem. But here it is...
I have big_table with 45 million rows and little_table with 12,000 rows. I need to do a left join to include all big_table rows, along with the id's of little_table rows where big_table's timestamp overlaps with two timestamps in little_table.
This doesn't seem like it should be an extreme operation for PG, but it is taking 2 1/2 hours!
Any ideas on what I can do here? Or do you think I have unwittingly come up against the limitations of my software/hardware combo given the table size?
Thanks!
little_table with 12,000 rows
CREATE TABLE public.little_table
(
id bigint,
start_time timestamp without time zone,
stop_time timestamp without time zone
);
CREATE INDEX idx_little_table
ON public.little_table USING btree
(start_time, stop_time DESC);
big_table with 45 million rows
CREATE TABLE public.big_table
(
id bigint,
datetime timestamp without time zone
) ;
CREATE INDEX idx_big_table
ON public.big_table USING btree
(datetime);
Query
explain analyze
select
bt.id as bt_id,
lt.id as lt_id
from
big_table bt
left join
little_table lt
on
(bt.datetime between lt.start_time and lt.stop_time)
Explain Results
Nested Loop Left Join (cost=0.29..3260589190.64 rows=64945831346 width=16) (actual time=0.672..9163998.367 rows=1374445323 loops=1)
-> Seq Scan on big_table bt (cost=0.00..694755.92 rows=45097792 width=16) (actual time=0.014..10085.746 rows=45098790 loops=1)
-> Index Scan using idx_little_table on little_table lt (cost=0.29..57.89 rows=1440 width=24) (actual time=0.188..0.199 rows=30 loops=45098790)
Index Cond: ((bt.datetime >= start_time) AND (bt.datetime <= stop_time))
Planning time: 0.165 ms
Execution time: 9199473.052 ms
NOTE: My actual query criteria is a bit more complex, but this seems to be the root of the problem. If I can fix this part, I think I can fix the rest.
This query cannot perform any faster.
Since there is no equality operator (=) in your join condition, the only strategy left to PostgreSQL is a nested loop join. 45 million repetitions of an index scan on the small table just take a while.
I would suggest trying to change the start_time and end_time columns in the
little table to a single tsrange column. According to the docs, this datatype supports a GIST index which can speed up the "range contains element" operator #>. Maybe this will do better than the index scan on your current btree.
Generating 1.3 billion rows seems pretty extreme to me. How often do you need to do this, and how fast do you need it to be?
To explain a bit about your current plan:
Index Cond: ((bt.datetime >= start_time) AND (bt.datetime <= stop_time))
While it is not obvious from what is displayed above, this always scans about half the index. It starts at the beginning of the index, and stops once start_time > bt.datetime, using bt.datetime <= stop_time as an in-index filter that need to examine each row before rejecting it.
To flesh out Bergi's answer, you could do this:
alter table little_table add range tsrange;
update little_table set range =tsrange(start_time,stop_time,'[]');
create index on little_table using gist(range);
select
bt.id as bt_id,
lt.id as lt_id
from
big_table bt
left join
little_table lt
on
(bt.datetime <# lt.range)
In my hands, that is about 4 times faster than your current method.
If your join did not need to do a left join, then you could get some more efficient operations by joining the tables in the opposite order. Perhaps you could get better performance by separating this into 2 operations, and inner join and then a probe for missing values, and combining the results.

Selecting primary key:Why postgres prefers to do sequential scan vs index scan

I have the following table
create table log
(
id bigint default nextval('log_id_seq'::regclass) not null
constraint log_pkey
primary key,
level integer,
category varchar(255),
log_time timestamp,
prefix text,
message text
);
It contains like 3 million of rows.
I'm comparing the following queries:
EXPLAIN SELECT id
FROM log
WHERE log_time < now() - INTERVAL '3 month'
LIMIT 100000
which yields the following plan:
Limit (cost=0.00..19498.87 rows=100000 width=8)
-> Seq Scan on log (cost=0.00..422740.48 rows=2168025 width=8)
Filter: (log_time < (now() - '3 mons'::interval))
And the same query with ORDER BY id instruction added:
EXPLAIN SELECT id
FROM log
WHERE log_time < now() - INTERVAL '3 month'
ORDER BY id ASC
LIMIT 100000
which yields
Limit (cost=0.43..25694.15 rows=100000 width=8)
-> Index Scan using log_pkey on log (cost=0.43..557048.28 rows=2168031 width=8)
Filter: (log_time < (now() - '3 mons'::interval))
I have the following questions:
The absence of ORDER BY instruction allows Postgres not to care about the order of rows. They may be as well delivered sorted. Why it does not use index without ORDER BY?
How can Postgres use index in the first place in such a query? WHERE clause of the query contains a non-indexed column and to fetch that column, sequential database scan will be required, but the query with ORDER BY doesn't indicate that.
The Postgres manual page says:
For a query that requires scanning a large fraction of the table, an explicit sort is likely to be faster than using an index because it requires less disk I/O due to following a sequential access pattern
Can you please clarify this statement for me? Index is always ordered. And reading an ordered structure is always faster, it is always a sequential access (at least in terms of page scanning) than reading non-ordered data and then ordering it manually.
Can you please clarify this statement for me? Index is always ordered. And reading an ordered structure is always faster, it is always a sequential access (at least in terms of page scanning) than reading non-ordered data and then ordering it manually.
The index is read sequentially, yes, but postgres needs to follow up with a read of the rows from the table. That is, in most cases, if an index identifies 100 rows, then postgres will need to perform up to 100 random reads against the table.
Internally, the postgres planner weighs sequential and random reads differently, with random reads generally much more expensive. The settings seq_page_cost and random_page_cost determine those. There are other settings you can view and tinker with if you want, though I recommend being very conservative with modifications.
Let's go back to your earlier questions:
The absence of ORDER BY instruction allows Postgres not to care about the order of rows. They may be as well delivered sorted. Why it does not use index without ORDER BY?
The reason is the sort. As you note later, the index doesn't include the constraining column, so it doesn't make any sense to use the index. Instead, the planner is basically saying "read the whole table, figure out which rows conform to the constraint, and then return the first 100000 of them, in whatever order we find them".
The sort changes things. In that case, the planner is saying "we need to sort by this field, and we have an index which is already sorted, so read rows from the table in index order, checking against the constraint, until we have 100000 of them, and return that set".
You'll note that the cost estimates (e.g. '0.43..25694.15') are much higher for the second query -- the planner thinks that doing so many random reads from the index scan is going to cost significantly more than just reading the whole table at once with no sorting.
Hope that helps, and let me know if you have further questions.