Slow postgres query even though it does bitmap index scan - postgresql

I have a table with 4707838 rows. When I run the following query on this table it takes around 9 seconds to execute.
SELECT json_agg(
json_build_object('accessorId',
p."accessorId",
'mobile',json_build_object('enabled', p.mobile,'settings',
json_build_object('proximityAccess', p."proximity",
'tapToAccess', p."tapToAccess",
'clickToAccessRange', p."clickToAccessRange",
'remoteAccess',p."remote")
),'
card',json_build_object('enabled',p."card"),
'fingerprint',json_build_object('enabled',p."fingerprint"))
) AS permissions
FROM permissions AS p
WHERE p."accessPointId"=99
The output of explain analyze is as follows:
Aggregate (cost=49860.12..49860.13 rows=1 width=32) (actual time=9011.711..9011.712 rows=1 loops=1)
Buffers: shared read=29720
I/O Timings: read=8192.273
-> Bitmap Heap Scan on permissions p (cost=775.86..49350.25 rows=33991 width=14) (actual time=48.886..8704.470 rows=36556 loops=1)
Recheck Cond: ("accessPointId" = 99)
Heap Blocks: exact=29331
Buffers: shared read=29720
I/O Timings: read=8192.273
-> Bitmap Index Scan on composite_key_accessor_access_point (cost=0.00..767.37 rows=33991 width=0) (actual time=38.767..38.768 rows=37032 loops=1)
Index Cond: ("accessPointId" = 99)
Buffers: shared read=105
I/O Timings: read=32.592
Planning Time: 0.142 ms
Execution Time: 9012.719 ms
This table has a btree index on accessorId column and composite index on (accessorId,accessPointId).
Can anyone tell me what could be the reason for this query to be slow even though it uses an index?

Over 90% of the time is waiting to get data from disk. At 3.6 ms per read, that is pretty fast for a harddrive (suggesting that much of the data was already in the filesystem cache, or that some of the reads brought in neighboring data that was also eventually required--that is sequential reads not just random reads) but slow for a SSD.
If you set enable_bitmapscan=off and clear the cache (or pick a not recently used "accessPointId" value) what performance do you get?
How big is the table? If you are reading a substantial fraction of the table and think you are not getting as much benefit from sequential reads as you should be, you can try making your OSes readahead settings more aggressive. On Linux that is something like sudo blockdev --setra ...
You could put all columns referred to by the query into the index, to enable index-only scans. But given the number of columns you are using that might be impractical. You could want "accessPointId" to be the first column in the index. By the way, is the index currently used really on (accessorId,accessPointId)? It looks to me like "accessPointId" is really the first column in that index, not the 2nd one.
You could cluster the table by an index which has "accessPointId" as the first column. That would group the related records together for faster access. But note it is a slow operation and takes a strong lock on the table while it is running, and future data going into the table won't be clustered, only the current data.
You could try to increase effective_io_concurrency so that you can have multiple io requests outstanding at a time. How effective this is will depend on your hardware.

Related

Postgresql getting last row without use of sequence or timestamp

At job, we are using a program which is writing all its values in a single table.
Some of these data can be for instance settings of a machine and it can happen that these data does not change for a whole month. The program writes only on data change.
So getting last point of data was always with something like:
SELECT value
FROM table
WHERE tag_id = id
SORT BY timestamp DESC
LIMIT 1;
If I understand correctly in this case it will look up the whole table, which means longer execution time and wasting resources.
I will have multiple of these functions runningĀ  and spamming server which is not acceptable.
I can not use last value of sequence of id as multiple values are writing in same table sharing same id sequence. So we are talking last row in table from certain tag_id since those are different for different data and not last row of table.
I was looking at fetch but from my understanding it will take same time as limit.
I was thinking along of FOR LOOP going back in section of time till 1 value is not found. And maybe later optimizing with temporary table where last value timestamp is written and then searching since that time stamp and if none found using last value.
But i was wondering if there is a better/faster more optimized way in Postgres.
If you want to speed up the query, you should create an index. Switching to a LOOP is almost always a recipe to make things slower.
To support your query, create the following index:
create index on the_table (tag_id, "timestamp");
On a test table with a million rows and 1000 different tag_id values, the execution plan looks like this:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..1.39 rows=1 width=53) (actual time=0.066..0.067 rows=1 loops=1)
Buffers: shared hit=1 read=3
I/O Timings: read=0.051
-> Index Scan Backward using test_table_tag_id_timestamp_idx on test_table (cost=0.42..924.66 rows=953 width=53) (actual time=0.066..0.066 rows=1 loops=1)
Index Cond: (tag_id = 41)
Buffers: shared hit=1 read=3
I/O Timings: read=0.051
Planning:
Buffers: shared hit=57 read=1
I/O Timings: read=0.011
Planning Time: 5.717 ms
Execution Time: 0.089 ms
You can see that Postgres only needed to read 4 blocks (shared hit=1 read=3) to get the row in question. This will be pretty much constant even if the table grows.
Note that the high planning time is caused by the buffer reads to fetch the table's metadata because I ran the query right after creating the table and inserting the rows. If the query is run for a second time, this pretty much vanishes as the meta data will be cached and the planning time will go down to substantially less then one millisecond.

How can I optimize this query in Postgres

The below query is taking more time to run. How can I optimize the below query to run for more records? I have run Explain Analyze for this query. Attached the output for the same.
This was the existing query created as a View and taking a long time (more than hours) to return the result.
I have done vacuum, analyze and reindex on these 2 tables but no luck.
select st_tr.step_trail_id,
st_tr.test_id,
st_tr.trail_id,
st_tr.step_name,
filter.regular_expression as filter_expression,
filter.order_of_occurrence as filter_order,
filter.match_type as filter_match_type,
null as begins_with,
null as ends_with,
null as input_source,
null as pattern_expression,
null as pattern_matched,
null as pattern_status,
null as pattern_order,
'filter' as record_type
from tab_report_step st_tr,
tab_report_filter filter
where st_tr.st_tr_id = filter.st_tr_id)
Query plan:
Hash Join (cost=446852.58..1176380.76 rows=6353676 width=489) (actual time=16641.953..47270.831 rows=6345360 loops=1)
Buffers: shared hit=1 read=451605 dirtied=5456 written=5424, temp read=154080 written=154074
-> Seq Scan on tab_report_filter filter (cost=0..24482.76 rows=6353676 width=161) (actual time=0.041..8097.233 rows=6345360 loops=1)
Buffers: shared read=179946 dirtied=4531 written=4499
-> Hash (cost=318817.7..318817.7 rows=4716070 width=89) (actual time=16627.291..16627.291 rows=4709040 loops=1)
Buffers: shared hit=1 read=271656 dirtied=925 written=925, temp written=47629
-> Seq Scan on tab_report_step st_tr (cost=0..318817.7 rows=4716070 width=89) (actual time=0.059..10215.484 rows=4709040 loops=1)
Buffers: shared hit=1 read=271656 dirtied=925 written=925
You have not run VACUUM on these tables. Perhaps VACUUM (FULL), but certainly not VACUUM.
There are two things that can be improved:
Make sure that no pages have to be dirtied or written while you read them. That is most likely because this is the first time you read the rows, and PostgreSQL sets hint bits.
Running VACUUM (without FULL) would have fix that. Also, if you repeat the experiment, you shouldn't get those dirtied and written buffers any more.
Give the query more memory by increasing work_mem. The hash does not fit in work_mem and spills to disk, which causes extra disk reads and writes, which is bad for performance.
Since you join two big tables with no restricting WHERE conditions and have a lot of result rows, this query will never be fast.

Postgres slow when selecting many rows

I'm running Postgres 11.
I have a table with 1.000.000 (1 million) rows and each row has a size of 40 bytes (it contains 5 columns). That is equal to 40MB.
When I execute (directly executed on the DB via DBeaver, DataGrid ect.- not called via Node, Python ect.):
SELECT * FROM TABLE
it takes 40 secs first time (is this not very slow even for the first time).
The CREATE statement of my tables:
CREATE TABLE public.my_table_1 (
c1 int8 NOT NULL GENERATED ALWAYS AS IDENTITY,
c2 int8 NOT NULL,
c3 timestamptz NULL,
c4 float8 NOT NULL,
c5 float8 NOT NULL,
CONSTRAINT my_table_1_pkey PRIMARY KEY (id)
);
CREATE INDEX my_table_1_c3_idx ON public.my_table_1 USING btree (c3);
CREATE UNIQUE INDEX my_table_1_c2_idx ON public.my_table_1 USING btree (c2);
On 5 random tables: EXPLAIN (ANALYZE, BUFFERS) select * from [table_1...2,3,4,5]
Seq Scan on table_1 (cost=0.00..666.06 rows=34406 width=41) (actual time=0.125..7.698 rows=34406 loops=1)
Buffers: shared read=322
Planning Time: 15.521 ms
Execution Time: 10.139 ms
Seq Scan on table_2 (cost=0.00..9734.87 rows=503187 width=41) (actual time=0.103..57.698 rows=503187 loops=1)
Buffers: shared read=4703
Planning Time: 14.265 ms
Execution Time: 74.240 ms
Seq Scan on table_3 (cost=0.00..3486217.40 rows=180205440 width=41) (actual time=0.022..14988.078 rows=180205379 loops=1)
Buffers: shared hit=7899 read=1676264
Planning Time: 0.413 ms
Execution Time: 20781.303 ms
Seq Scan on table_4 (cost=0.00..140219.73 rows=7248073 width=41) (actual time=13.638..978.125 rows=7247991 loops=1)
Buffers: shared hit=7394 read=60345
Planning Time: 0.246 ms
Execution Time: 1264.766 ms
Seq Scan on table_5 (cost=0.00..348132.60 rows=17995260 width=41) (actual time=13.648..2138.741 rows=17995174 loops=1)
Buffers: shared hit=82 read=168098
Planning Time: 0.339 ms
Execution Time: 2730.355 ms
When I add a LIMIT 1.000.000 to table_5 (it contains 1.7 million rows)
Limit (cost=0.00..19345.79 rows=1000000 width=41) (actual time=0.007..131.939 rows=1000000 loops=1)
Buffers: shared hit=9346
-> Seq Scan on table_5(cost=0.00..348132.60 rows=17995260 width=41) (actual time=0.006..68.635 rows=1000000 loops=1)
Buffers: shared hit=9346
Planning Time: 0.048 ms
Execution Time: 164.133 ms
When I add a WHERE clause between 2 dates (I'm monitored the query below with DataDog software and the results are here (max.~ 31K rows/sec when fetching): https://www.screencast.com/t/yV0k4ShrUwSd):
Seq Scan on table_5 (cost=0.00..438108.90 rows=17862027 width=41) (actual time=0.026..2070.047 rows=17866766 loops=1)
Filter: (('2018-01-01 00:00:00+04'::timestamp with time zone < matchdate) AND (matchdate < '2020-01-01 00:00:00+04'::timestamp with time zone))
Rows Removed by Filter: 128408
Buffers: shared hit=168180
Planning Time: 14.820 ms
Execution Time: 2673.171 ms
All tables has an unique index on the c3 column.
The size of the database is like 500GB in total.
The server has 16 cores and 112GB M2 memory.
I have tried to optimize Postgres system variables - Like: WorkMem(1GB), shared_buffer(50GB), effective_cache_size (20GB) - But it doesn't seems to change anything (I know the settings has been applied - because I can see a big difference in the amount of idle memory the server has allocated).
I know the database is too big for all data to be in memory. But is there anything I can do to boost the performance / speed of my query?
Make sure CreatedDate is indexed.
Make sure CreatedDate is using the date column type. This will be more efficient on storage (just 4 bytes), performance, and you can use all the built in date formatting and functions.
Avoid select * and only select the columns you need.
Use YYYY-MM-DD ISO 8601 format. This has nothing to do with performance, but it will avoid a lot of ambiguity.
The real problem is likely that you have thousands of tables with which you regularly make unions of hundreds of tables. This indicates a need to redesign your schema to simplify your queries and get better performance.
Unions and date change checks suggest a lot of redundancy. Perhaps you've partitioned your tables by date. Postgres has its own built in table partitioning which might help.
Without more detail that's all I can say. Perhaps ask another question about your schema.
Without seeing EXPLAIN (ANALYZE, BUFFERS), all we can do is speculate.
But we can do some pretty good speculation.
Cluster the tables on the index on CreatedDate. This will allow the data to be accessed more sequentially, allowing more read-ahead (but this might not help much for some kinds of storage). If the tables have high write load, they may not stay clustered and so you would have recluster them occasionally. If they are static, this could be a one-time event.
Get more RAM. If you want to perform as if all the data was in memory, then get all the data into memory.
Get faster storage, like top-notch SSD. It isn't as fast as RAM, but much faster than HDD.

Why is Postgres not using index on a simple GROUP BY?

I have created a 36M rows table with an index on type column:
CREATE TABLE items AS
SELECT
(random()*36000000)::integer AS id,
(random()*10000)::integer AS type,
md5(random()::text) AS s
FROM
generate_series(1,36000000);
CREATE INDEX items_type_idx ON items USING btree ("type");
I run this simple query and expect postgresql to use my index:
explain select count(*) from "items" group by "type";
But the query planner decides to use Seq Scan instead:
HashAggregate (cost=734592.00..734627.90 rows=3590 width=12) (actual time=6477.913..6478.344 rows=3601 loops=1)
Group Key: type
-> Seq Scan on items (cost=0.00..554593.00 rows=35999800 width=4) (actual time=0.044..1820.522 rows=36000000 loops=1)
Planning time: 0.107 ms
Execution time: 6478.525 ms
Time without EXPLAIN: 5s 979ms
I have tried several solutions from here and here:
Run VACUUM ANALYZE or VACUUM ANALYZE
Configure default_statistics_target, random_page_cost, work_mem
but nothing helps apart from setting enable_seqscan = OFF:
SET enable_seqscan = OFF;
explain select count(*) from "items" group by "type";
GroupAggregate (cost=0.56..1114880.46 rows=3590 width=12) (actual time=5.637..5256.406 rows=3601 loops=1)
Group Key: type
-> Index Only Scan using items_type_idx on items (cost=0.56..934845.56 rows=35999800 width=4) (actual time=0.074..2783.896 rows=36000000 loops=1)
Heap Fetches: 0
Planning time: 0.103 ms
Execution time: 5256.667 ms
Time without EXPLAIN: 659ms
Query with index scan is about 10x faster on my machine.
Is there a better solution than setting enable_seqscan?
UPD1
My postgresql version is 9.6.3, work_mem = 4MB (tried 64MB), random_page_cost = 4 (tried 1.1), max_parallel_workers_per_gather = 0 (tried 4).
UPD2
I have tried to fill type column not with random numbers, but with i / 10000 to make pg_stats.correlation = 1 - still seqscan.
UPD3
#jgh is 100% right:
This typically only happens when the table's row width is much wider than some indexes
I've made large column data and now postgres use index. Thanks everyone!
The Index-only scans wiki says
It is important to realise that the planner is concerned with
minimising the total cost of the query. With databases, the cost of
I/O typically dominates. For that reason, "count(*) without any
predicate" queries will only use an index-only scan if the index is
significantly smaller than its table. This typically only happens when
the table's row width is much wider than some indexes'.
and
Index-only scans are only used when the planner surmises that that
will reduce the total amount of I/O required, according to its
imperfect cost-based modelling. This all heavily depends on visibility
of tuples, if an index would be used anyway (i.e. how selective a
predicate is, etc), and if there is actually an index available that
could be used by an index-only scan in principle
Accordingly, your index is not considered "significantly smaller" and the entire dataset is to be read, which leads the planner in using a seq scan

Slow query with good plan

We have two server, the newest is going to replace the oldest one. They almost the same regarding performances, except for a single query.The same query in two different servers (same database definition, same data, indexes just rebuilt from scratch) take MUCH more time in the newest instance.
The two plans are identical and the qwery pretty simple:
Nested Loop (cost=0.00..17.83 rows=1 width=2262) (actual time=0.032..0.032 rows=0 loops=1)
Buffers: shared hit=3
-> Index Scan using psan_para_fk_ix on parasetana a0 (cost=0.00..9.48 rows=1 width=1735) (actual time=0.030..0.030 rows=0 loops=1)
Index Cond: (((ca)::text = 'r'::text) AND (idp = 36678502::numeric))
Filter: (flg = '1'::bpchar)
Buffers: shared hit=3
-> Index Scan using seta_pk on seta a1 (cost=0.00..8.33 rows=1 width=527) (never executed)
Index Cond: (((a1.ca)::text = 'r'::text) AND (a1.idgrla = a0.idgrla ) AND (a1.prog = a0.prog_set))
Filter: (a1.flgp = '0'::bpchar)
Total runtime: 0.153 ms
(10 rows)
Time: 2217.074 ms
As you can see, the total runtime is 0.2ms. It is so in both the new and the old server. However the Time in the old server is 30ms, in the new server is 200 times more (2.2 seconds vs 30 millis)
What can cause such difference? The postgresql doc says that in select statements the total runtime and the the time should be nearly the same...
thanks
As I understand this is a simple join using nested loops with appropriate indexing. The problem should be due to bad caching of second (large) table. Here possibly the second table is badly clustered with respect to the index used. Try CLUSTER command to see if it helps.
Also you may try to change plan. The options you may need - swap join order, use hash join.