My postgresql version is 10.6. I have created an index but that is not used for all where clause condition check. Below are more details :
Create index concurrently ticket_created_at_portal_id_created_by_id_assigned_group_id_idx on ticket(created_at, portal_id, created_by_id, assigned_group);
EXPLAIN (analyze true, verbose true, costs true, buffers true, timing true ) select * from ticket where status is not null
and (assigned_group in ('447') or created_by_id in ('39731566'))
and portal_id=8
and created_at>='2020-12-07T03:00:10.973'
and created_at<='2021-02-05T03:00:10.973'
order by updated_at DESC limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=18975.23..18975.25 rows=10 width=638) (actual time=278.340..278.345 rows=10 loops=1)
Output: id, action, assigned_agent, assigned_at, assigned_group, attachments, closed_at, created_at, created_by_email, created_by_id, description, first_response_time, parent_id, portal_id, priority, reopened_at, resolution_id, resolved_at, resolved_by_id, resource_id, resource_type, source, status, subject, tags, ticket_category, ticket_id, ticket_sub_category, ticket_sub_sub_category, type, updated_at, custom_fields, updated_by_id, first_assigned_at, mode, sla_breached, agent_assist_tags, comm_vendor, from_email
Buffers: shared hit=2280 read=3105
-> Sort (cost=18975.23..18975.45 rows=87 width=638) (actual time=278.338..278.339 rows=10 loops=1)
Output: id, action, assigned_agent, assigned_at, assigned_group, attachments, closed_at, created_at, created_by_email, created_by_id, description, first_response_time, parent_id, portal_id, priority, reopened_at, resolution_id, resolved_at, resolved_by_id, resource_id, resource_type, source, status, subject, tags, ticket_category, ticket_id, ticket_sub_category, ticket_sub_sub_category, type, updated_at, custom_fields, updated_by_id, first_assigned_at, mode, sla_breached, agent_assist_tags, comm_vendor, from_email
Sort Key: ticket.updated_at DESC
Sort Method: top-N heapsort Memory: 33kB
Buffers: shared hit=2280 read=3105
-> Bitmap Heap Scan on public.ticket (cost=17855.76..18973.35 rows=87 width=638) (actual time=111.871..275.835 rows=1256 loops=1)
Output: id, action, assigned_agent, assigned_at, assigned_group, attachments, closed_at, created_at, created_by_email, created_by_id, description, first_response_time, parent_id, portal_id, priority, reopened_at, resolution_id, resolved_at, resolved_by_id, resource_id, resource_type, source, status, subject, tags, ticket_category, ticket_id, ticket_sub_category, ticket_sub_sub_category, type, updated_at, custom_fields, updated_by_id, first_assigned_at, mode, sla_breached, agent_assist_tags, comm_vendor, from_email
Recheck Cond: (((ticket.assigned_group = '447'::bigint) AND (ticket.portal_id = 8)) OR ((ticket.created_at >= '2020-12-07 03:00:10.973'::timestamp without time zone) AND (ticket.created_at <= '2021-02-05 03:00:10.973'::timestamp without time zone) AND (ticket.portal_id = 8) AND (ticket.created_by_id = '39731566'::bigint)))
Filter: ((ticket.status IS NOT NULL) AND (ticket.created_at >= '2020-12-07 03:00:10.973'::timestamp without time zone) AND (ticket.created_at <= '2021-02-05 03:00:10.973'::timestamp without time zone))
Rows Removed by Filter: 1517
Heap Blocks: exact=2638
Buffers: shared hit=2277 read=3105
-> BitmapOr (cost=17855.76..17855.76 rows=291 width=0) (actual time=106.215..106.216 rows=0 loops=1)
Buffers: shared hit=336 read=2408
-> Bitmap Index Scan on ticket_assigned_group_portal_id_assigned_agent_idx (cost=0.00..11.25 rows=282 width=0) (actual time=10.661..10.661 rows=2776 loops=1)
Index Cond: ((ticket.assigned_group = '447'::bigint) AND (ticket.portal_id = 8))
Buffers: shared hit=4 read=15
-> Bitmap Index Scan on ticket_created_at_portal_id_created_by_id_assigned_group_id_idx (cost=0.00..17844.47 rows=9 width=0) (actual time=95.551..95.551 rows=2 loops=1)
Index Cond: ((ticket.created_at >= '2020-12-07 03:00:10.973'::timestamp without time zone) AND (ticket.created_at <= '2021-02-05 03:00:10.973'::timestamp without time zone) AND (ticket.portal_id = 8) AND (ticket.created_by_id = '39731566'::bigint))
Buffers: shared hit=332 read=2393
Planning time: 43.083 ms
Execution time: 278.556 ms
(25 rows)
ticket_created_at_portal_id_created_by_id_assigned_group_id_idx is having all the columns of where clause except status is not null, but still query is using separate index for Index Cond: ((ticket.assigned_group = '447'::bigint) AND (ticket.portal_id = 8)) which is already present in 2nd index ticket_created_at_portal_id_created_by_id_assigned_group_id_idx.
Why it is so? Even when I include status column as well in the index, still query was using 2 index and hen doing a heavy filter on index heap scan.
How we can optimize it?
Also I did experiment with indexes of table but still unable to remove indexes. It seems that same columns are repeated in multiple indexes for diff queries, Please help if we can reduce these no of indexes.
All indexes for table are:
"ticket_pkey" PRIMARY KEY, btree (id)
"ticket_ticket_id_idx" UNIQUE, btree (ticket_id)
"uk2uors84i0m8sjxc6oaocuy6oj" UNIQUE CONSTRAINT, btree (ticket_id)
"idx_resource_id" btree (resource_id)
"idx_ticket_created_at" btree (created_at)
"ticket_assigned_agent_idx" btree (assigned_agent)
"ticket_assigned_group_idx" btree (assigned_group)
"ticket_assigned_group_portal_id_assigned_agent_idx" btree (assigned_group, portal_id, assigned_agent)
"ticket_created_at_portal_id_created_by_id_assigned_group_id_idx" btree (created_at, portal_id, created_by_id, assigned_group)
"ticket_created_at_portal_id_status_idx" btree (created_at, portal_id, status)
"ticket_id_resolved_at_assigned_group_status_idx" btree (id, resolved_at, assigned_group, status)
How easy would it be to use a phone book (sorted by last-name then first-name) to find every one with a first name of "Francis" whose last name starts with a letter between K and T? Not very easy, because it is not sorted by first name. You would have to go through the entire middle half of the phone book, reading everyone's first name.
Same here. When the first column in your index is used in a range/inequality query rather than equality, it makes all the columns after that one much less efficient. You would want to put the columns used for equality and not in an OR first. Unfortunately, that is only portal_id. The best thing to put next would depend on how selective each of those other conditions are which we can't guess from the info provided.
In deciding this, status IS NULL would be the same thing as equality, but status IS NOT NULL is not as there are any number of values it could be while still being not null, so it is effectively the same as an inequality. If this condition is highly selective, the best way to incorporate it would be in the WHERE of a partial index.
Because of the OR, you might still be best off with 2 indexes which could be combined in a bitmap or.
...(portal_id, assigned_group, created_at) WHERE status IS NOT NULL;
...(portal_id, created_by_id, created_at) WHERE status IS NOT NULL;
Another approach would be to avoid fetching and sorting all of the matching rows, by walking rows in order by updated_at using an index and stopping after 10 of them are found. An index can be used for walking a column in order as long as only things tested for equality (and without ORs) occur before the ORDER BY column, so:
...(portal_id, updated_at) WHERE status IS NOT NULL;
Approximately every 10 min I insert ~50 records with the same timestamp.
It means ~600 records per hour or 7.200 records per day or 2.592.000 records per year.
User wants to retrieve all records for the timestamp closest to the asked time.
Design #1 - one table with index on timestamp column:
CREATE TABLE A (t timestamp, value int);
CREATE a_idx ON A (t);
Single insert statement creates ~50 records with the same timestamp:
INSERT INTO A VALUES (
(‘2019-01-02 10:00’, 5),
(‘2019-01-02 10:00’, 12),
(‘2019-01-02 10:00’, 7),
….
)
Get all records which are closest to the asked time
(I use the function greatest() available in PostgreSQL):
SELECT * FROM A WHERE t =
(SELECT t FROM A ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
I think this query is not efficient because it requires the full table scan.
I plan to partition the A table by timestamp to have 1 partition per year, but the approximate match above still will be slow.
Design #2 - create 2 tables:
1st table: to keep the unique timestamps and auto-incremented PK,
2nd table: to keep data and the foreign key on 1st table PK
CREATE TABLE UNIQ_TIMESTAMP (id SERIAL PRIMARY KEY, t timestamp);
CREATE TABLE DATA (id INTEGER REFERENCES UNIQ_TIMESTAMP (id), value int);
CREATE INDEX data_time_idx ON DATA (id);
Get all records which are closest to the asked time:
SELECT * FROM DATA WHERE id =
(SELECT id FROM UNIQ_TIMESTAMP ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)
It should run faster compared to Design #1 because the nested select scans the smaller table.
Disadvantage of this approach:
- I have to insert into 2 tables instead just one
- I lost the ability to partition the DATA table by timestamp
What you could recommend?
I'd go with tje single table approach, perhaps partitioned by year so that it becomes easy to get rid of old data.
Create an index like
CREATE INDEX ON a (date_trunc('hour', t + INTERVAL '30 minutes'));
Then use your query like you wrote it, but add
AND date_trunc('hour', t + INTERVAL '30 minutes')
= date_trunc('hour', asked_time + INTERVAL '30 minutes')
The additional condition acts as a filter and can use the index.
You can use a UNION of two queries to find all timestamps closest to a given one:
(
select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1
)
union all
(
select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1
)
That will efficiently make use of an index on t. On a table with 10 million rows (~3 years of data), I get the following execution plan:
Append (cost=0.57..1.16 rows=2 width=8) (actual time=0.381..0.407 rows=2 loops=1)
Buffers: shared hit=6 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.380..0.381 rows=1 loops=1)
Output: a.t
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Index Only Scan using a_t_idx on stuff.a (cost=0.57..253023.35 rows=30699415 width=8) (actual time=0.380..0.380 rows=1 loops=1)
Output: a.t
Index Cond: (a.t >= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=1 read=4
I/O Timings: read=0.050
-> Limit (cost=0.57..0.58 rows=1 width=8) (actual time=0.024..0.025 rows=1 loops=1)
Output: a_1.t
Buffers: shared hit=5
-> Index Only Scan Backward using a_t_idx on stuff.a a_1 (cost=0.57..649469.88 rows=78800603 width=8) (actual time=0.024..0.024 rows=1 loops=1)
Output: a_1.t
Index Cond: (a_1.t <= '2019-03-01 17:00:00'::timestamp without time zone)
Heap Fetches: 0
Buffers: shared hit=5
Planning Time: 1.823 ms
Execution Time: 0.425 ms
As you can see it only requires very few I/O operations and that is pretty much independent of the table size.
The above can be used for an IN condition:
select *
from a
where t in (
(select t
from a
where t >= timestamp '2019-03-01 17:00:00'
order by t
limit 1)
union all
(select t
from a
where t <= timestamp '2019-03-01 17:00:00'
order by t desc
limit 1)
);
If you know you will never have more than 100 values close to that requested timestamp, you could remove the IN query completely and simply use a limit 100 in both parts of the union. That makes the query a bit more efficient as there is no second step for evaluating the IN condition, but might return more rows than you want.
If you always look for timestamps in the same year, then partitioning by year will indeed help with this.
You can put that into a function if it is too complicated as a query:
create or replace function get_closest(p_tocheck timestamp)
returns timestamp
as
$$
select *
from (
(select t
from a
where t >= p_tocheck
order by t
limit 1)
union all
(select t
from a
where t <= p_tocheck
order by t desc
limit 1)
) x
order by greatest(t - p_tocheck, p_tocheck - t)
limit 1;
$$
language sql stable;
The the query gets as simple as:
select *
from a
where t = get_closest(timestamp '2019-03-01 17:00:00');
Another solution is to use the btree_gist extension which provides a "distance" operator <->
Then you can create a GiST index on the timestamp:
create index on a using gist (t) ;
and use the following query:
select *
from a where t in (select t
from a
order by t <-> timestamp '2019-03-01 17:00:00'
limit 1);
Say I have the following tables and indices:
create table inbound_messages(id int, user_id int, received_at timestamp);
create table outbound_messages(id int, user_id int, sent_at timestamp);
create index on inbound_messages(user_id, received_at);
create index on outbound_messages(user_id, sent_at);
Now I want to pull out the last 20 messages for a user, either inbound or outbound in a specific time range. I can do the following and from the explain it looks like PG walks back both indices in 'parallel' so it minimises the amount of rows it needs to scan.
explain select * from (select id, user_id, received_at as time from inbound_messages union all select id, user_id, sent_at as time from outbound_messages) x where user_id = 5 and time between '2018-01-01' and '2020-01-01' order by user_id,time desc limit 20;
Limit (cost=0.32..16.37 rows=2 width=16)
-> Merge Append (cost=0.32..16.37 rows=2 width=16)
Sort Key: inbound_messages.received_at DESC
-> Index Scan Backward using inbound_messages_user_id_received_at_idx on inbound_messages (cost=0.15..8.17 rows=1 width=16)
Index Cond: ((user_id = 5) AND (received_at >= '2018-01-01 00:00:00'::timestamp without time zone) AND (received_at <= '2020-01-01 00:00:00'::timestamp without time zone))
-> Index Scan Backward using outbound_messages_user_id_sent_at_idx on outbound_messages (cost=0.15..8.17 rows=1 width=16)
Index Cond: ((user_id = 5) AND (sent_at >= '2018-01-01 00:00:00'::timestamp without time zone) AND (sent_at <= '2020-01-01 00:00:00'::timestamp without time zone))
For example it could do something crazy like find all the matching rows in memory, and then sort the rows. Lets say there were millions of matching rows then this could take a long time. But because it walks the indices in the same order we want the results in this is a fast operation. It looks like the 'Merge Append' operation is done lazily and it doesn't actually materialize all the matching rows.
Now we can see postgres supports this operation for two distinct tables, however is it possible to force Postgres to use this optimisation for a single table.
Lets say I wanted the last 20 inbound messages for user_id = 5 or user_id = 6.
explain select * from inbound_messages where user_id in (6,7) order by received_at desc limit 20;
Then we get a query plan that does a bitmap heap scan, and then does an in-memory sort. So if there are millions of messages that match then it will look at millions of rows even though theoretically it could use the same Merge trick to only look at a few rows.
Limit (cost=15.04..15.09 rows=18 width=16)
-> Sort (cost=15.04..15.09 rows=18 width=16)
Sort Key: received_at DESC
-> Bitmap Heap Scan on inbound_messages (cost=4.44..14.67 rows=18 width=16)
Recheck Cond: (user_id = ANY ('{6,7}'::integer[]))
-> Bitmap Index Scan on inbound_messages_user_id_received_at_idx (cost=0.00..4.44 rows=18 width=0)
Index Cond: (user_id = ANY ('{6,7}'::integer[]))
We could think of just adding (received_at) as an index on the table and then it will do the same backwards scan. However, if we have a large number of users then we are missing out on a potentially large speedup because we are scanning lots of index entries that would not match the query.
The following approach should work as a way of forcing Postgres to use the "merge append" plan when you are interested in most recent messages for two users from the same table.
[Note: I tested this on YugabyteDB (which is based on Postgres)- so I expect the same to apply to Postgres also.]
explain select * from (
(select * from inbound_messages where user_id = 6 order by received_at DESC)
union all
(select * from inbound_messages where user_id = 7 order by received_at DESC)
) AS result order by received_at DESC limit 20;
which produces:
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.01..3.88 rows=20 width=16)
-> Merge Append (cost=0.01..38.71 rows=200 width=16)
Sort Key: inbound_messages.received_at DESC
-> Index Scan Backward using inbound_messages_user_id_received_at_idx on inbound_messages (cost=0.00..17.35 rows=100 width=16)
Index Cond: (user_id = 6)
-> Index Scan Backward using inbound_messages_user_id_received_at_idx on inbound_messages inbound_messages_1 (cost=0.00..17.35 rows=100 width=16)
Index Cond: (user_id = 7)
I have a very simple SQL:
select * from email.email_task where acquire_time < now() and state IN ('CREATED', 'RELEASED') order by creation_time asc limit 1;
I have 2 indexes created:
Index of state
Index of state, acquire_time, creation_time
Ideally I think Postgres should pick the 2nd one since it matches every column required in this SQL:
However the execution plan shows differently, it uses neither of the indexes:
Limit (cost=187404.36..187404.36 rows=1 width=743)
-> Sort (cost=187404.36..190753.58 rows=1339690 width=743)
Sort Key: creation_time
-> Seq Scan on email_task (cost=0.00..180705.91 rows=1339690 width=743)
Filter: (((state)::text = 'CREATED'::text) AND (acquire_time < now()))
I understand that if the number of rows returned arrives like 10% of total, then it would pick Seq Scan over Index Scan. (As explained at Why does PostgreSQL perform sequential scan on indexed column?
) So that's why index1 is not picked.
What I don't understand is why index2 is not picked since matches all the columns?
Then I tried a 3rd index:
Index of create_time, acquire_time, state
And this time it uses the index3 (I add the index using another smaller database
perf_1 because the original one has 2 million rows and it takes too much time)
Limit (cost=0.29..0.36 rows=1 width=75) (actual time=0.043..0.043 rows=1 loops=1)
-> Index Scan using perf_1 on email_task (cost=0.29..763.76 rows=9998 width=75) (actual time=0.042..0.042 rows=1 loops=1)
Index Cond: (acquire_time < now())
Filter: ((state)::text = ANY ('{CREATED,RELEASED}'::text[]))
It seems that, Postgres execution planner is picking the order by clause first then the where clause which is a little bit counter-intuitive.
Is my understanding correct or there are some other factors that impact the Postgres planner?
Thanks in advance.
I have a master/detail Table situation. For every entry at the master table I have dozens at the detail table.
Lets say these are my tables:
+-----------------------
| Master
+-----------------------
| master_key integer,
| insert_date timestamp
+-----------------------
+-----------------------
| Detail
+-----------------------
| detail_key integer,
| master_key integer
| quantity numeric
| amount numeric
+-----------------------
And my most used Query is something like
SELECT extract(year from insert_date) AS Insert_Year, extract(month from insert_date) AS Insert_Month, sum(quantity) AS Quantity, sum(amount) AS Amount
FROM Master, Detail
WHERE (amount not null) and (insert_date <= '2016-12-31') and (insert_date >= '2015-01-01') and (Detail.master_key=Master.master_key)
GROUP BY Insert_Year, Insert_Month
ORDER BY Insert_Year ASC, Insert_Month ASC;
This query becomes to slow because there are tons of data for many years in both tables.
Of cause I have Indexes at both tables and EXPLAIN ANALYZE tells me that the INDEX scan takes mode than 80% of the hole Execution Time.
"Sort (cost=44013.52..44013.53 rows=1 width=19) (actual time=17073.129..17073.129 rows=16 loops=1)"
" Sort Key: (date_part('year'::text, master.insert_date)), (date_part('month'::text, master.insert_date))"
" Sort Method: quicksort Memory: 26kB"
" -> HashAggregate (cost=44013.49..44013.51 rows=1 width=19) (actual time=17073.046..17073.053 rows=16 loops=1)"
" Group Key: date_part('year'::text, master.insert_date), date_part('month'::text, master.insert_date)"
" -> Nested Loop (cost=0.43..43860.32 rows=15317 width=19) (actual time=0.056..15951.178 rows=843647 loops=1)"
" -> Seq Scan on master (cost=0.00..18881.38 rows=3127 width=12) (actual time=0.027..636.202 rows=182338 loops=1)"
" Filter: ((date(insert_date) >= '2015-01-01'::date) AND (date(insert_date) <= '2016-12-31'::date))"
" Rows Removed by Filter: 443031"
" -> Index Scan using idx_detail_master_key on detail (cost=0.43..7.89 rows=7 width=15) (actual time=0.055..0.077 rows=5 loops=182338)"
" Index Cond: (master_key = master.master_key)"
" Filter: (amount IS NOT NULL)"
" Rows Removed by Filter: 2"
"Planning time: 105.317 ms"
"Execution time: 17073.396 ms"
So my idea was to reduce the index sizes by defining them partial. In most cases only data of the last 2 years are queried.
So I tried something like that:
CREATE INDEX idx_detail_table_master_keys
ON detail (master_key)
WHERE master_key in (SELECT master_key FROM master WHERE (extract( year from insert_date) = 2016) or (extract( year from insert_date) = 2015))
Of cause this is not the final version it should just be a proof of concept and it failed. PGAdmin Tells me that I'm not allowed to use subselects on Index creation.
So my question is: Is it posiple to create a partial Index, basing on the data of an other table?
And of cause I would be thankful for any Tips speeding constellations like this up.
regards
It is not possible to create a partial index, basing on the data of another table because relational databases like postgresql have three ways of joining: nested loops, hash-join and sort-merge. All of these methods load the tables of a join seperately. Since the database optimizer decides which of those methods will be used and also in which direction the join will be executed, it doesn't make sense to create a table index that covers data of another table. This is why you can't define such indexes. A more detailed description about this topics can be found here: http://use-the-index-luke.com/sql/join (and the following sections of the online book)
For further optimization see the comment of Gabriel's Messanger