Optimize Postgres query - Bitmap heap scan seems slow - postgresql

I'm testing a query and wondering if performance can be improved.
There are 9 million records in this table. The database used is Postgres 12.
The query should be quick for any account, docket type, quantity.
SELECT d.docket_id, d.docket_type_id
FROM docket d
WHERE d.account_id = 557 AND (d.docket_type_id = ANY(array[2]))
AND exists (
select 1
from jsonb_array_elements(d.equipment) as eq
where (eq->'quantity')::integer >= 150)
order by docket_id desc
--limit 500

There is no good way to directly index jsonb for the existence of a value >= some constant.
You could write a function to summarize the jsonb to the maximum value, index that, then test if this maximum is >= your constant:
create function jsonb_max(jsonb, text) returns integer language sql as $$
select max((e->>$2)::int) from jsonb_array_elements($1) f(e);
$$ immutable parallel safe;
create index on docket (account_id , docket_type_id, jsonb_max(equipment,'quantity'));
SELECT d.docket_id, d.docket_type_id
FROM docket d
WHERE d.account_id = 557
AND (d.docket_type_id = ANY(array[2]))
AND jsonb_max(equipment,'quantity')>=150
order by docket_id desc

Related

Range contains element filter not using index of the element column

I have a table with a timestamp column that has a btree index.
CREATE TABLE tstest AS
SELECT '2019-11-10'::timestamp + random() * '10 day'::interval ts
FROM generate_series(1,10000) s;
CREATE INDEX ON tstest(ts);
I would like to find all row between a timerange. Both ends of the range can be "infinite"/null, start of the range must be exclusive and end is inclusice. So this forms a query:
SELECT * FROM tstest WHERE ts <# tsrange($1, $2, '(]');
The result is correct but the index of the ts column is not used and a seq scan is done instead.
To use the index correctly I have to do the query like this:
SELECT * FROM tstest WHERE ($1 IS null OR $1 < ts) AND ($2 IS null OR ts <= $2);
I like the <# syntax more. It is easier to understand and it is shorter.
Is there something I can do differently to utilize the index and make the query faster? Maybe a different type of index instead?
I have also tried to add a gist index for ts column using the btree_gist module but that didn't change the situation.
I have tested this with PostgreSQL 9.6 and 12.0.
A btree index (which you are using) doesn't support the <# operator and a GiST index wouldn't help because the timestamp column isn't a range type.
But you can still make the btree index usable by using coalesce and infinity which removes the use of the OR condition:
SELECT *
FROM tstest
WHERE ts > coalesce($1, '-infinity')
and ts <= coalesce($2, 'infinity');

Optimizing a query that compares a table to itself with millions of rows

I could use some help optimizing a query that compares rows in a single table with millions of entries. Here's the table's definition:
CREATE TABLE IF NOT EXISTS data.row_check (
id uuid NOT NULL DEFAULT NULL,
version int8 NOT NULL DEFAULT NULL,
row_hash int8 NOT NULL DEFAULT NULL,
table_name text NOT NULL DEFAULT NULL,
CONSTRAINT row_check_pkey
PRIMARY KEY (id, version)
);
I'm reworking our push code and have a test bed with millions of records across about 20 tables. I run my tests, get the row counts, and can spot when some of my insert code has changed. The next step is to checksum each row, and then compare the rows for differences between versions of my code. Something like this:
-- Run my test of "version 0" of the push code, the base code I'm refactoring.
-- Insert the ID and checksum for each pushed row.
INSERT INTO row_check (id,version,row_hash,table_name)
SELECT id, 0, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_check_pkey DO UPDATE SET
row_hash = EXCLUDED.row_hash,
table_name = EXCLUDED.table_name;
truncate table record_changes_log;
-- Run my test of "version 1" of the push code, the new code I'm validating.
-- Insert the ID and checksum for each pushed row.
INSERT INTO row_check (id,version,row_hash,table_name)
SELECT id, 1, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_check_pkey DO UPDATE SET
row_hash = EXCLUDED.row_hash,
table_name = EXCLUDED.table_name;
That gets two rows in row_check for every row in record_changes_log, or any other table I'm checking. For the two runs of record_changes_log, I end up with more than 8.6M rows in row_check. They look like this:
id version row_hash table_name
e6218751-ab78-4942-9734-f017839703f6 0 -142492569 record_changes_log
6c0a4111-2f52-4b8b-bfb6-e608087ea9c1 0 -1917959999 record_changes_log
7fac6424-9469-4d98-b887-cd04fee5377d 0 -323725113 record_changes_log
1935590c-8d22-4baf-85ba-00b563022983 0 -1428730186 record_changes_log
2e5488b6-5b97-4755-8a46-6a46317c1ae2 0 -1631086027 record_changes_log
7a645ffd-31c5-4000-ab66-a565e6dad7e0 0 1857654119 record_changes_log
I asked yesterday for some help on the comparison query, and it lead to this:
select v0.table_name,
v0.id,
v0.row_hash as v0,
v1.row_hash as v1
from row_check v0
join row_check v1 on v0.id = v1.id and
v0.version = 0 and
v1.version = 1 and
v0.row_hash <> v1.row_hash
That works, but now I'm hoping to optimize it a bit. As an experiment, I clustered the data on version and then built a BRIN index, like this:
drop index if exists row_check_version_btree;
create index row_check_version_btree
on row_check
using btree(version);
cluster row_check using row_check_version_btree;
drop index row_check_version_btree; -- Eh? I want to see how the BRIN performs.
drop index if exists row_check_version_brin;
create index row_check_version_brin
on row_check
using brin(row_hash);
vacuum analyze row_check;
I ran the query through explain analyze and get this:
Merge Join (cost=1.12..559750.04 rows=4437567 width=51) (actual time=1511.987..14884.045 rows=10 loops=1)
Output: v0.table_name, v0.id, v0.row_hash, v1.row_hash
Inner Unique: true
Merge Cond: (v0.id = v1.id)
Join Filter: (v0.row_hash <> v1.row_hash)
Rows Removed by Join Filter: 4318290
Buffers: shared hit=8679005 read=42511
-> Index Scan using row_check_pkey on ascendco.row_check v0 (cost=0.56..239156.79 rows=4252416 width=43) (actual time=0.032..5548.180 rows=4318300 loops=1)
Output: v0.id, v0.version, v0.row_hash, v0.table_name
Index Cond: (v0.version = 0)
Buffers: shared hit=4360752
-> Index Scan using row_check_pkey on ascendco.row_check v1 (cost=0.56..240475.33 rows=4384270 width=24) (actual time=0.031..6070.790 rows=4318300 loops=1)
Output: v1.id, v1.version, v1.row_hash, v1.table_name
Index Cond: (v1.version = 1)
Buffers: shared hit=4318253 read=42511
Planning Time: 1.073 ms
Execution Time: 14884.121 ms
...which I did not really get the right idea from...so I ran it again to JSON and fed the results into this wonderful plan visualizer:
http://tatiyants.com/pev/#/plans
The tips there are right, the top node estimate is bad. The result is 10 rows, the estimate is for about 443,757 rows.
I'm hoping to learn more about optimizing this kind of thing, and this query seems like a good opportunity. I have a lot of notions about what might help:
-- CREATE STATISTICS?
-- Rework the query to move the where comparison?
-- Use a better index? I did try a GIN index and a straight B-tree on version, but neither was superior.
-- Rework the row_check format to move the two hashes into the same row instead of splitting them over two rows, compare on insert/update, flag non-matches, and add a partial index for the non-matching values.
Granted, it's funny to even try to index something where there are only two values (0 and 1 in the case above), so there's that. In fact, is there any sort of clever trick for Booleans? I'll always be comparing two versions, so "old" and "new", which I can express however makes life best. I understand that Postgres only has bitmap indexes internally at search/merge (?) time and that it does not have a bitmap type index. Would there be some kind of INTERSECT that might help? I don't know how Postgres implements set math operators internally.
Thanks for any suggestions on how to rethink this data or the query to make it faster for comparisons involving millions, or tens of millions, of rows.
I'm going to add an answer to my own question, but am still interested in what anyone else has to say. In the process of writing out my original question, I realized that maybe a redesign is in order. This hinges on my plan to only ever compare two versions at a time. That's a good solution here, but there are other cases where it wouldn't work. Anyway, here's a slightly different table design that folds the two results into a single row:
DROP TABLE IF EXISTS data.row_compare;
CREATE TABLE IF NOT EXISTS data.row_compare (
id uuid NOT NULL DEFAULT NULL,
hash_1 int8, -- Want NULL to defer calculating hash comparison until after both hashes are entered.
hash_2 int8, -- Ditto
hashes_match boolean, -- Likewise
table_name text NOT NULL DEFAULT NULL,
CONSTRAINT row_compare_pkey
PRIMARY KEY (id)
);
The following expression index should, hopefully, be very small as I shouldn't have any non-matching entries:
CREATE INDEX row_compare_fail ON row_compare (hashes_match)
WHERE hashes_match = false;
The trigger below does the column calculation, once hash_1 and hash_2 are both provided:
-- Run this as a BEFORE INSERT or UPDATE ROW trigger.
CREATE OR REPLACE FUNCTION data.on_upsert_row_compare()
RETURNS trigger AS
$BODY$
BEGIN
IF NEW.hash_1 = NULL OR
NEW.hash_2 = NULL THEN
RETURN NEW; -- Don't do the comparison, hash_1 hasn't been populated yet.
ELSE-- Do the comparison. The point of this is to avoid constantly thrashing the expression index.
NEW.hashes_match := NEW.hash_1 = NEW.hash_2;
RETURN NEW; -- important!
END IF;
END;
$BODY$
LANGUAGE plpgsql;
This now adds 4.3M rows instead of 8.6M rows:
-- Add the first set of results and build out the row_compare records.
INSERT INTO row_compare (id,hash_1,table_name)
SELECT id, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_compare_pkey DO UPDATE SET
hash_1 = EXCLUDED.hash_1,
table_name = EXCLUDED.table_name;
-- I'll truncate the record_changes_log and push my sample data again here.
-- Add the second set of results and update the row compare records.
-- This time, the hash is going into the hash_2 field for comparison
INSERT INTO row_compare (id,hash_2,table_name)
SELECT id, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_compare_pkey DO UPDATE SET
hash_2 = EXCLUDED.hash_2,
table_name = EXCLUDED.table_name;
And now the results are simple to find:
select * from row_compare where hashes_match = false;
This changes the query time from around 17 seconds to around 24 milliseconds.

Select query became very very very slow in postgresql

I have one table which contains "133,072,194" records and I am trying to execute
SELECT COUNT(test)
FROM mytable
WHERE test = false
but it is taking Execution time: 128320.712 ms
I already have indexing on test column. Could you please let me know, what I can optimize or change, so my query became faster?
Because of this, my other select query is also not working.
If there are many rows where test is FALSE, you won't be able to get an exact result faster than with a sequential scan, which is slow for big tables.
If you have only few rows that satisfy the condition, you should create a partial index:
CREATE INDEX mytable_notest_ind ON mytable(id) WHERE NOT test;
(assuming that id is the primary key) and keep mytable autovacuumed often enough that you get an index only scan.
But usually exact results for queries like this are not required.
You could calculate an estimated count from the table statistics with a query like this:
SELECT t.reltuples
* (1 - t.nullfrac)
* mcv.freq AS count_false
FROM pg_stats AS s
CROSS JOIN LATERAL unnest(s.most_common_vals::text::boolean[],
s.most_common_freqs) AS mcv(val, freq)
JOIN pg_class AS t
ON s.tablename = t.relname
AND s.schemaname = t.relnamespace::regnamespace::text
WHERE s.tablename = 'mytable'
AND s.attname = 'test'
AND mcv.val = FALSE;
That would be very fast.
See my blog post for more considerations about the speed of SELECT count(*).

Postgres: Optimize a huge GROUP BY

I have such a table:
CREATE TABLE values (first_id varchar(26), sec_id int, mode varchar(6), external1_id varchar(23), external2_id varchar(26), x int, y int);
There may be multiple values having the same first_id, my goal is to flatten (into json) for each first_id, all the related rows, into another table.
I do this this way:
INSERT INTO othervalues(first_id, results)
SELECT first_id, json_agg(values) AS results FROM values GROUP BY first_id;
In the results column, I have a json array of all the rows, that I can use later as it is.
My problem is that this is very slow, with a huge table: with about 100 000 000 rows in values, this slows down my computer (I actually test locally) until it dies (this is an Ubuntu).
Using EXPLAIN I noticed that is used a GroupPartitioner, I added:
SET work_mem = '1GB';
Now it uses a HashPartitioner, but this still kills my computer. An explain gives me:
Insert on othervalues (cost=2537311.89..2537316.89 rows=200 width=64)
-> Subquery Scan on "*SELECT*" (cost=2537311.89..2537316.89 rows=200 width=64)
-> HashAggregate (cost=2537311.89..2537314.39 rows=200 width=206)
-> Seq Scan on values (cost=0.00..2251654.26 rows=57131526 width=206)
Any idea how to optimize it?
The solution I finally use is to split the GROUP BY into multiple:
First I create a temporary table with the unique IDs of the stuff I want to group. This allow to get only a part of the results - like with OFFSET and LIMIT - but these there can be very slow with huge results sets (big offset mean the execution tree will yet browse the first results):
CREATE TEMP TABLE tempids AS SELECT ROW_NUMBER() OVER (ORDER BY theid), theid FROM (SELECT DISTINCT theid FROM sourcetable) sourcetable;
Then in a WHILE loop:
DO $$DECLARE
r record;
i INTEGER := 0;
step INTEGER := 500000;
size INTEGER := (SELECT COUNT(*) FROM tempids);
BEGIN
WHILE i <= size
LOOP
INSERT INTO target(theid, theresult)
SELECT ...
WHERE tempids > i AND tempids < i + step
GROUP BY tempids.theid;
This looks like usual coding, this is not nice sql, but this works.

Postgresql function executed much longer than the same query

I'm using PostgreSQL 9.2.9 and have the following problem.
There are function:
CREATE OR REPLACE FUNCTION report_children_without_place(text, date, date, integer)
RETURNS TABLE (department_name character varying, kindergarten_name character varying, a1 bigint) AS $BODY$
BEGIN
RETURN QUERY WITH rh AS (
SELECT (array_agg(status ORDER BY date DESC))[1] AS status, request
FROM requeststatushistory
WHERE date <= $3
GROUP BY request
)
SELECT
w.name,
kgn.name,
COUNT(*)
FROM kindergarten_request_table_materialized kr
JOIN rh ON rh.request = kr.id
JOIN requeststatuses s ON s.id = rh.status AND s.sysname IN ('confirmed', 'need_meet_completion', 'kindergarten_need_meet')
JOIN workareas kgn ON kr.kindergarten = kgn.id AND kgn.tree <# CAST($1 AS LTREE) AND kgn.active
JOIN organizationforms of ON of.id = kgn.organizationform AND of.sysname IN ('state','municipal','departmental')
JOIN workareas w ON w.tree #> kgn.tree AND w.active
JOIN workareatypes mt ON mt.id = w.type AND mt.sysname = 'management'
WHERE kr.requestyear = $4
GROUP BY kgn.name, w.name
ORDER BY w.name, kgn.name;
END
$BODY$ LANGUAGE PLPGSQL STABLE;
EXPLAIN ANALYZE SELECT * FROM report_children_without_place('83.86443.86445', '14-04-2015', '14-04-2015', 2014);
Total runtime: 242805.085 ms.
But query from function's body executes much faster:
EXPLAIN ANALYZE WITH rh AS (
SELECT (array_agg(status ORDER BY date DESC))[1] AS status, request
FROM requeststatushistory
WHERE date <= '14-04-2015'
GROUP BY request
)
SELECT
w.name,
kgn.name,
COUNT(*)
FROM kindergarten_request_table_materialized kr
JOIN rh ON rh.request = kr.id
JOIN requeststatuses s ON s.id = rh.status AND s.sysname IN ('confirmed', 'need_meet_completion', 'kindergarten_need_meet')
JOIN workareas kgn ON kr.kindergarten = kgn.id AND kgn.tree <# CAST('83.86443.86445' AS LTREE) AND kgn.active
JOIN organizationforms of ON of.id = kgn.organizationform AND of.sysname IN ('state','municipal','departmental')
JOIN workareas w ON w.tree #> kgn.tree AND w.active
JOIN workareatypes mt ON mt.id = w.type AND mt.sysname = 'management'
WHERE kr.requestyear = 2014
GROUP BY kgn.name, w.name
ORDER BY w.name, kgn.name;
Total runtime: 2156.740 ms.
Why function executed so longer than the same query? Thank's
Your query runs faster because the "variables" are not actually variable -- they are static values (IE strings in quotes). This means the execution planner can leverage indexes. Within your stored procedure, your variables are actual variables, and the planner cannot make assumptions about indexes. For example - you might have a partial index on requeststatushistory where "date" is <= '2012-12-31'. The index can only be used if the $3 is known. Since it might hold a date from 2015, the partial index would be of no use. In fact, it would be detrimental.
I frequently construct a string within my functions where I concatenate my variables as literals and then execute the function using something like the following:
DECLARE
my_dynamic_sql TEXT;
BEGIN
my_dynamic_sql := $$
SELECT *
FROM my_table
WHERE $$ || quote_literal($3) || $$::TIMESTAMPTZ BETWEEN start_time
AND end_time;$$;
/* You can only see this if client_min_messages = DEBUG */
RAISE DEBUG '%', my_dynamic_sql;
RETURN QUERY EXECUTE my_dynamic_sql;
END;
The dynamic SQL is VERY useful because you can actually get an explain of the query when I have set client_min_messages=DEBUG; I can scrape the query from the screen and paste it back in after EXPLAIN or EXPLAIN ANALYZE and see what the execution planner is doing. This also allows you to construct very different queries as needed to optimize for variables (IE exclude unnecessary tables if warranted) and maintain a common API for your clients.
You may be tempted to avoid the dynamic SQL for fear of performance issues (I was at first) but you will be amazed at how LITTLE time is spent in planning compared to some of the cost of a couple of table scans on your seven-table join!
Good luck!
Follow-up: You might experiment with Common Table Expressions (CTEs) for performance as well. If you have a table that has a low signal-to-noise ratio (has many, many more records in it than you actually want to return) then a CTE can be very helpful. PostgreSQL executes CTEs early in the query, and materializes the resulting rows in memory. This allows you to use the same result set multiple times and in multiple places in your query. The benefit can really be surprising if you design it correctly.
sql_txt := $$
WITH my_cte as (
select fk1 as moar_data 1
, field1
, field2 /*do not need all other fields taking up RAM!*/
from my_table
where field3 between $$ || quote_literal(input_start_ts) || $$::timestamptz
and $$ || quote_literal(input_end_ts) || $$::timestamptz
),
keys_cte as ( select key_field
from big_look_up_table
where look_up_name = ANY($$ ||
QUOTE_LITERAL(input_array_of_names) || $$::VARCHAR[])
)
SELECT field1, field2, moar_data1, moar_data2
FROM moar_data_table
INNER JOIN my_cte
USING (moar_data1)
WHERE moar_data_table.moar_data_key in (select key_field from keys_cte) $$;
An execution plan is likely to show that it chooses to use an index on moar_data_tale.moar_data_key. This would appear to go against what I said above in my prior answer - except for the fact that the keys_cte results are materialized (and therefore cannot be changed by another transaction in a race-condition) -- you have your own little copy of the data for use in this query.
Oh - and CTEs can use other CTEs that are declared earlier in the same query. I have used this "trick" to replace sub-queries in very complex joins and seen great improvements.
Happy Hacking!