PostgreSQL bitmap heap scan takes long time - postgresql

I have Case table which has almost 25,00,000 rows and 176 columns (mostly varchar)
select
count(ca.id)
from
salesforce.case ca
where
ca.accountid = '001i000000E'
AND ca.createddate BETWEEN current_date - interval '6 months' AND current_date
I am trying to get number of records created for last 6 months for specific account. But when I look at the explain and analyze, I found that it first index scans all the records created in that timespan and index scans all records for that account and then does bitmap heap scan (which takes long time).
https://explain.depesz.com/s/8Lje
Is there any way we can make it faster?

You can create an index ON ca(accountid, createddate) that can be used for both conditions.
Given on the numbers in your explain output, it may be that PostgreSQL still uses a bitmap index scan because it thinks that is faster. You can try to lower random_page_cost, see if different plans are chosen and test which one is the fastest.

Related

PSQL - Performance of query filtered by intervals on two separate fields

I have got a PostgreSQL table that covers time intervals.
This is a simplified structure of my table
CREATE TABLE intervals (
name varchar(40),
time_from timestamp,
time_to timestamp
);
The table contains millions of records, but, if you apply a filter in a specific point of time in the past, the number of records for which
time_from <= [requested time] <= time_to
are always very limited in number (not more than 3k results). So, a query like this one
SELECT *
FROM intervals
WHERE time_from <= '2020-01-01T10:00:00' and time_to >= '2020-01-01T10:00:00'
is supposed to return a relatively small amount of results, and, in theory, it should be quite fast if I used the correct index. But it's not fast at all
I tried adding a combined index on time_from and time_to, but the engine doesn't pick it.
Seq Scan on intervals (cost=0.00..156152.46 rows=428312 width=32) (actual time=13.223..3599.840 rows=4981 loops=1)
Filter: ((time_from <= '2020-01-01T10:00:00') AND (time_to >= '2020-01-01T10:00:00'))
Rows Removed by Filter: 2089650
Planning Time: 0.159 ms
Execution Time: 3600.618 ms
What type of index should I add, in order to speed up this query?
A btree index cannot be very efficient here. It can quickly throw out everything whose time_from > '2020-01-01T10:00:00', but that is probably not all that much of the table (at least, not if your table goes back for many years). Once the first column of the index has been consumed in this way, the next column cannot be used very efficiently. It can only jump to a specific part of time_to values within the ties of time_from, and that is just not very useful as there are probably not all that many ties. (At least, not that it can prove to itself while planning your query).
What you need is a gist index, which specializes in this kind of multi-dimensional thing:
create extension btree_gist ;
create index on intervals using gist (time_from,time_to);
This index will support your query as written. Another possibility is to index the time ranges and index those, rather than separate begin and end point.
-- this one does not need btree_gist.
create index on intervals using gist (tsrange(time_from,time_to));
But this index forces you to write the query differently:
SELECT * FROM intervals
WHERE tsrange(time_from,time_to) #> '2020-01-01T10:00:00'::timestamp

How do I improve date-based query performance on a large table?

This is related to 2 other questions I posted (sounds like I should post this as a new question) - the feedback helped, but I think the same issue will come back the next time I need to insert data. Things were running slowly still which forced me to temporarily remove some of the older data so that only 2 months' worth remained in the table that I'm querying.
Indexing strategy for different combinations of WHERE clauses incl. text patterns
How to get date_part query to hit index?
Giving further detail this time - hopefully it will help pinpoint the issue:
PG version 10.7 (running on heroku
Total DB size: 18.4GB (this contains 2 months worth of data, and it will grow at approximately the same rate each month)
15GB RAM
Total available storage: 512GB
The largest table (the one that the slowest query is acting on) is 9.6GB (it's the largest chunk of the total DB) - about 10 million records
Schema of the largest table:
-- Table Definition ----------------------------------------------
CREATE TABLE reportimpression (
datelocal timestamp without time zone,
devicename text,
network text,
sitecode text,
advertisername text,
mediafilename text,
gender text,
agegroup text,
views integer,
impressions integer,
dwelltime numeric
);
-- Indices -------------------------------------------------------
CREATE INDEX reportimpression_feb2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-02-01 00:00:00'::timestamp without time zone AND datelocal < '2019-03-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_mar2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-03-01 00:00:00'::timestamp without time zone AND datelocal < '2019-04-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_jan2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-01-01 00:00:00'::timestamp without time zone AND datelocal < '2019-02-01 00:00:00'::timestamp without time zone;
Slow query:
SELECT
date_part('hour', datelocal) AS hour,
SUM(CASE WHEN gender = 'male' THEN views ELSE 0 END) AS male,
SUM(CASE WHEN gender = 'female' THEN views ELSE 0 END) AS female
FROM reportimpression
WHERE
datelocal >= '3-1-2019' AND
datelocal < '4-1-2019'
GROUP BY date_part('hour', datelocal)
ORDER BY date_part('hour', datelocal)
The date range in this query will generally be for an entire month (it accepts user input from a web based report) - as you can see, I tried creating an index for each month's worth of data. That helped, but as far as I can tell, unless the query has recently been run (putting the results into the cache), it can still take up to a minute to run.
Explain analyze results:
Finalize GroupAggregate (cost=1035890.38..1035897.86 rows=1361 width=24) (actual time=3536.089..3536.108 rows=24 loops=1)
Group Key: (date_part('hour'::text, datelocal))
-> Sort (cost=1035890.38..1035891.06 rows=1361 width=24) (actual time=3536.083..3536.087 rows=48 loops=1)
Sort Key: (date_part('hour'::text, datelocal))
Sort Method: quicksort Memory: 28kB
-> Gather (cost=1035735.34..1035876.21 rows=1361 width=24) (actual time=3535.926..3579.818 rows=48 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Partial HashAggregate (cost=1034735.34..1034740.11 rows=1361 width=24) (actual time=3532.917..3532.933 rows=24 loops=2)
Group Key: date_part('hour'::text, datelocal)
-> Parallel Index Scan using reportimpression_mar2019_index on reportimpression (cost=0.09..1026482.42 rows=3301168 width=17) (actual time=0.045..2132.174 rows=2801158 loops=2)
Planning time: 0.517 ms
Execution time: 3579.965 ms
I wouldn't think 10 million records would be too much to handle, especially given that I recently bumped up the PG plan that I'm on to try to throw resources at it, so I assume that the issue is still just either my indexes or my queries not being very efficient.
A materialized view is the way to go for what you outlined. Querying past months of read-only data works without refreshing it. You may want to special-case the current month if you need to cover that, too.
The underlying query can still benefit from an index, and there are two directions you might take:
First off, partial indexes like you have now won't buy much in your scenario, not worth it. If you collect many more months of data and mostly query by month (and add / drop rows by month) table partitioning might be an idea, then you have your indexes partitioned automatically, too. I'd consider Postgres 11 or even the upcoming Postgres 12 for this, though.)
If your rows are wide, create an index that allows index-only scans. Like:
CREATE INDEX reportimpression_covering_idx ON reportimpression(datelocal, views, gender);
Related:
How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?
Or INCLUDE additional columns in Postgres 11 or later:
CREATE INDEX reportimpression_covering_idx ON reportimpression(datelocal) INCLUDE (views, gender);
Else, if your rows are physically sorted by datelocal, consider a BRIN index. It's extremely small and probably about as fast as a B-tree index for your case. (But being so small it will stay cached much easier and not push other data out as much.)
CREATE INDEX reportimpression_brin_idx ON reportimpression USING BRIN (datelocal);
You may be interested in CLUSTER or pg_repack to physically sort table rows. pg_repack can do it without exclusive locks on the table and even without a btree index (required by CLUSTER). But it's an additional module not shipped with the standard distribution of Postgres.
Related:
Optimize Postgres deletion of orphaned records
How to reclaim disk space after delete without rebuilding table?
Your execution plan seems to be doing the right thing.
Things you can do to improve, in descending order of effectiveness:
Use a materialized view that pre-aggregates the data
Don't use a hosted database, use your own iron with good local storage and lots of RAM.
Use only one index instead of several partitioned ones. This is not primarily a performance advice (the query will probably not be measurably slower unless you have a lot of indexes), but it will ease the management burden.

Selecting primary key:Why postgres prefers to do sequential scan vs index scan

I have the following table
create table log
(
id bigint default nextval('log_id_seq'::regclass) not null
constraint log_pkey
primary key,
level integer,
category varchar(255),
log_time timestamp,
prefix text,
message text
);
It contains like 3 million of rows.
I'm comparing the following queries:
EXPLAIN SELECT id
FROM log
WHERE log_time < now() - INTERVAL '3 month'
LIMIT 100000
which yields the following plan:
Limit (cost=0.00..19498.87 rows=100000 width=8)
-> Seq Scan on log (cost=0.00..422740.48 rows=2168025 width=8)
Filter: (log_time < (now() - '3 mons'::interval))
And the same query with ORDER BY id instruction added:
EXPLAIN SELECT id
FROM log
WHERE log_time < now() - INTERVAL '3 month'
ORDER BY id ASC
LIMIT 100000
which yields
Limit (cost=0.43..25694.15 rows=100000 width=8)
-> Index Scan using log_pkey on log (cost=0.43..557048.28 rows=2168031 width=8)
Filter: (log_time < (now() - '3 mons'::interval))
I have the following questions:
The absence of ORDER BY instruction allows Postgres not to care about the order of rows. They may be as well delivered sorted. Why it does not use index without ORDER BY?
How can Postgres use index in the first place in such a query? WHERE clause of the query contains a non-indexed column and to fetch that column, sequential database scan will be required, but the query with ORDER BY doesn't indicate that.
The Postgres manual page says:
For a query that requires scanning a large fraction of the table, an explicit sort is likely to be faster than using an index because it requires less disk I/O due to following a sequential access pattern
Can you please clarify this statement for me? Index is always ordered. And reading an ordered structure is always faster, it is always a sequential access (at least in terms of page scanning) than reading non-ordered data and then ordering it manually.
Can you please clarify this statement for me? Index is always ordered. And reading an ordered structure is always faster, it is always a sequential access (at least in terms of page scanning) than reading non-ordered data and then ordering it manually.
The index is read sequentially, yes, but postgres needs to follow up with a read of the rows from the table. That is, in most cases, if an index identifies 100 rows, then postgres will need to perform up to 100 random reads against the table.
Internally, the postgres planner weighs sequential and random reads differently, with random reads generally much more expensive. The settings seq_page_cost and random_page_cost determine those. There are other settings you can view and tinker with if you want, though I recommend being very conservative with modifications.
Let's go back to your earlier questions:
The absence of ORDER BY instruction allows Postgres not to care about the order of rows. They may be as well delivered sorted. Why it does not use index without ORDER BY?
The reason is the sort. As you note later, the index doesn't include the constraining column, so it doesn't make any sense to use the index. Instead, the planner is basically saying "read the whole table, figure out which rows conform to the constraint, and then return the first 100000 of them, in whatever order we find them".
The sort changes things. In that case, the planner is saying "we need to sort by this field, and we have an index which is already sorted, so read rows from the table in index order, checking against the constraint, until we have 100000 of them, and return that set".
You'll note that the cost estimates (e.g. '0.43..25694.15') are much higher for the second query -- the planner thinks that doing so many random reads from the index scan is going to cost significantly more than just reading the whole table at once with no sorting.
Hope that helps, and let me know if you have further questions.

How does postgres decide whether to use index scan or seq scan?

explain analyze shows that postgres will use index scanning for my query that fetches rows and performs filtering by date (i.e., 2017-04-14 05:27:51.039):
explain analyze select * from tbl t where updated > '2017-04-14 05:27:51.039';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Scan using updated on tbl t (cost=0.43..7317.12 rows=10418 width=93) (actual time=0.011..0.515 rows=1179 loops=1)
Index Cond: (updated > '2017-04-14 05:27:51.039'::timestamp without time zone)
Planning time: 0.102 ms
Execution time: 0.720 ms
however running the same query but with different date filter '2016-04-14 05:27:51.039' shows that postgres will run the query using seq scan instead:
explain analyze select * from tbl t where updated > '2016-04-14 05:27:51.039';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Seq Scan on tbl t (cost=0.00..176103.94 rows=5936959 width=93) (actual time=0.008..2005.455 rows=5871963 loops=1)
Filter: (updated > '2016-04-14 05:27:51.039'::timestamp without time zone)
Rows Removed by Filter: 947
Planning time: 0.100 ms
Execution time: 2910.086 ms
How does postgres decide on what to use, specifically when performing filtering by date?
The Postgres query planner bases its decisions on cost estimates and column statistics, which are gathered by ANALYZE and opportunistically by some other utility commands. That all happens automatically when autovacuum is on (by default).
The manual:
Most queries retrieve only a fraction of the rows in a table, due to
WHERE clauses that restrict the rows to be examined. The planner thus
needs to make an estimate of the selectivity of WHERE clauses, that
is, the fraction of rows that match each condition in the WHERE
clause. The information used for this task is stored in the
pg_statistic system catalog. Entries in pg_statistic are updated by
the ANALYZE and VACUUM ANALYZE commands, and are always approximate
even when freshly updated.
There is a row count (in pg_class), a list of most common values, etc.
The more rows Postgres expects to find, the more likely it will switch to a sequential scan, which is cheaper to retrieve large portions of a table.
Generally, it's index scan -> bitmap index scan -> sequential scan, the more rows are expected to be retrieved.
For your particular example, the important statistic is histogram_bounds, which give Postgres a rough idea how many rows have a greater value than the given one. There is the more convenient view pg_stats for the human eye:
SELECT histogram_bounds
FROM pg_stats
WHERE tablename = 'tbl'
AND attname = 'updated';
There is a dedicated chapter explaining row estimation in the manual.
Obviously, optimization of queries is tricky. This answer is not intended to dive into the specifics of the Postgres optimizer. Instead, it is intended to give you some background on how the decision to use an index is made.
Your first query is estimated to return 10,418 rows. When using an index, the following operations happen:
The engine uses the index to find the first value meeting the condition.
The engine then loops over the values, finishing when the condition is no longer true.
For each value in the index, the engine then looks up the data on the data page.
In other words, there is a little bit of overhead when using the index -- initializing the index and then looking up each data page individually.
When the engine does a full table scan it:
Starts with the first record on the first page
Does the comparison and accepts or rejects the record
Continues sequentially through all data pages
There is no additional overhead. Further, the engine can "pre-load" the next pages to be scanned while processing the current page. This overlap of I/O and processing is a big win.
The point I'm trying to make is that getting the balance between these two can be tricky. Somewhere between 10,418 and 5,936,959, Postgres decides that the index overhead (and fetching the pages randomly) costs more than just scanning the whole table.

Why does PostgreSQL perform sequential scan on indexed column?

Very simple example - one table, one index, one query:
CREATE TABLE book
(
id bigserial NOT NULL,
"year" integer,
-- other columns...
);
CREATE INDEX book_year_idx ON book (year)
EXPLAIN
SELECT *
FROM book b
WHERE b.year > 2009
gives me:
Seq Scan on book b (cost=0.00..25663.80 rows=105425 width=622)
Filter: (year > 2009)
Why it does NOT perform index scan instead?
What am I missing?
If the SELECT returns more than approximately 5-10% of all rows in the table, a sequential scan is much faster than an index scan.
This is because an index scan requires several IO operations for each row (look up the row in the index, then retrieve the row from the heap). Whereas a sequential scan only requires a single IO for each row - or even less because a block (page) on the disk contains more than one row, so more than one row can be fetched with a single IO operation.
Btw: this is true for other DBMS as well - some optimizations as "index only scans" taken aside (but for a SELECT * it's highly unlikely such a DBMS would go for an "index only scan")
Did you ANALYZE the table/database? And what about the statistics? When there are many records where year > 2009, a sequential scan might be faster than an index scan.
#a_horse_with_no_name explained it quite well. Also if you really want to use an index scan, you should generally use bounded ranges in where clause. eg -
year > 2019 and year < 2020.
A lot of the times statistics are not updated on a table and it may not be possible to do so due to constraints. In this case, the optimizer will not know how many rows it should take in year > 2019. Thus it selects a sequential scan in lieu of full knowledge. Bounded partitions will solve the problem most of the time.
In index scan, read head jumps from one row to another which is 1000 times slower than reading the next physical block (in the sequential scan).
So, if the (number of records to be retrieved * 1000) is less than the total number of records, the index scan will perform better.