Postgres: Optimize a huge GROUP BY - postgresql

I have such a table:
CREATE TABLE values (first_id varchar(26), sec_id int, mode varchar(6), external1_id varchar(23), external2_id varchar(26), x int, y int);
There may be multiple values having the same first_id, my goal is to flatten (into json) for each first_id, all the related rows, into another table.
I do this this way:
INSERT INTO othervalues(first_id, results)
SELECT first_id, json_agg(values) AS results FROM values GROUP BY first_id;
In the results column, I have a json array of all the rows, that I can use later as it is.
My problem is that this is very slow, with a huge table: with about 100 000 000 rows in values, this slows down my computer (I actually test locally) until it dies (this is an Ubuntu).
Using EXPLAIN I noticed that is used a GroupPartitioner, I added:
SET work_mem = '1GB';
Now it uses a HashPartitioner, but this still kills my computer. An explain gives me:
Insert on othervalues (cost=2537311.89..2537316.89 rows=200 width=64)
-> Subquery Scan on "*SELECT*" (cost=2537311.89..2537316.89 rows=200 width=64)
-> HashAggregate (cost=2537311.89..2537314.39 rows=200 width=206)
-> Seq Scan on values (cost=0.00..2251654.26 rows=57131526 width=206)
Any idea how to optimize it?

The solution I finally use is to split the GROUP BY into multiple:
First I create a temporary table with the unique IDs of the stuff I want to group. This allow to get only a part of the results - like with OFFSET and LIMIT - but these there can be very slow with huge results sets (big offset mean the execution tree will yet browse the first results):
CREATE TEMP TABLE tempids AS SELECT ROW_NUMBER() OVER (ORDER BY theid), theid FROM (SELECT DISTINCT theid FROM sourcetable) sourcetable;
Then in a WHILE loop:
DO $$DECLARE
r record;
i INTEGER := 0;
step INTEGER := 500000;
size INTEGER := (SELECT COUNT(*) FROM tempids);
BEGIN
WHILE i <= size
LOOP
INSERT INTO target(theid, theresult)
SELECT ...
WHERE tempids > i AND tempids < i + step
GROUP BY tempids.theid;
This looks like usual coding, this is not nice sql, but this works.

Related

Time Consuming SQL Update Statement

In Postgresql (version 9.2), I need to update a table with values from another table. The UPDATE statement below works and completes quickly on a small data set (1K records). With large amount of records (600K+), the statement has not completed after more than two hours. I don't know if it is taking a long time or is simply hung.
UPDATE training_records r SET cid =
(SELECT cid_main FROM account_events e
WHERE e.user_ekey = r.ekey
AND e.type = 't'
AND r.enroll_date < e.time
ORDER BY e.time ASC LIMIT 1)
WHERE r.cid IS NULL;
Is there a problem with this statement? Is there a more efficient way to do this?
About the operation: training_records holds course enrollment records for member accounts (id by ekey) that are organized in groups. cid is the group id. account_events holds account changing events including transfers between groups (e.type='t'), where cid_main would be the group id before the transfer. I am trying to retroactively patch the newly added cid column in training_records so it accurately reflects the group membership when the course was enrolled. There could be multiple transfers, so I am picking the group id (cid_main) from the earliest transfer after the time of enrollment. Hope this makes sense.
The table training_records has close to 700K records, and account_events has 560K+ records.
Output of EXPLAIN {command above}
Update on training_records r (cost=0.00..13275775666.76 rows=664913 width=74)
-> Seq Scan on training_records r (cost=0.00..13275775666.76 rows=664913 width=74)
Filter: (cid IS NULL)
SubPlan 1
-> Limit (cost=19966.15..19966.16 rows=1 width=12)
-> Sort (cost=19966.15..19966.16 rows=1 width=12)
Sort Key: e."time"
-> Seq Scan on account_events e (cost=0.00..19966.15 rows=1 width=12)
Filter: ((r.enroll_date < "time") AND (user_ekey = r.ekey) AND (type = 't'::bpchar))
(9 rows)
One more udpate:
Adding an additional condition in WHERE, I limited the number of records from training_records to about 10K. The update took about 15 minutes. If the time is even to close to being linear to the number of records of this one table, 700K records would take about 17+ hours.
Thank you for your help!
Update: It took close to 9 hours, but the original command completed.
Try to transform it to something that does not force a nested loop join:
UPDATE training_records r
SET cid = e.cid_main
FROM account_events e
WHERE e.user_ekey = r.ekey
AND e.type = 't'
AND r.enroll_date < e.time
AND NOT EXISTS (SELECT 1 FROM account_events e1
WHERE e1.user_ekey = r.ekey
AND e1.type = 't'
AND r.enroll_date < e1.time
AND e1.time < e.time)
AND r.cid IS NULL;
The statement actually isn't equivalent: if there is no matching account_events row, your statement will update cid to NULL, while my statement will not update that row.

Optimizing a query that compares a table to itself with millions of rows

I could use some help optimizing a query that compares rows in a single table with millions of entries. Here's the table's definition:
CREATE TABLE IF NOT EXISTS data.row_check (
id uuid NOT NULL DEFAULT NULL,
version int8 NOT NULL DEFAULT NULL,
row_hash int8 NOT NULL DEFAULT NULL,
table_name text NOT NULL DEFAULT NULL,
CONSTRAINT row_check_pkey
PRIMARY KEY (id, version)
);
I'm reworking our push code and have a test bed with millions of records across about 20 tables. I run my tests, get the row counts, and can spot when some of my insert code has changed. The next step is to checksum each row, and then compare the rows for differences between versions of my code. Something like this:
-- Run my test of "version 0" of the push code, the base code I'm refactoring.
-- Insert the ID and checksum for each pushed row.
INSERT INTO row_check (id,version,row_hash,table_name)
SELECT id, 0, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_check_pkey DO UPDATE SET
row_hash = EXCLUDED.row_hash,
table_name = EXCLUDED.table_name;
truncate table record_changes_log;
-- Run my test of "version 1" of the push code, the new code I'm validating.
-- Insert the ID and checksum for each pushed row.
INSERT INTO row_check (id,version,row_hash,table_name)
SELECT id, 1, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_check_pkey DO UPDATE SET
row_hash = EXCLUDED.row_hash,
table_name = EXCLUDED.table_name;
That gets two rows in row_check for every row in record_changes_log, or any other table I'm checking. For the two runs of record_changes_log, I end up with more than 8.6M rows in row_check. They look like this:
id version row_hash table_name
e6218751-ab78-4942-9734-f017839703f6 0 -142492569 record_changes_log
6c0a4111-2f52-4b8b-bfb6-e608087ea9c1 0 -1917959999 record_changes_log
7fac6424-9469-4d98-b887-cd04fee5377d 0 -323725113 record_changes_log
1935590c-8d22-4baf-85ba-00b563022983 0 -1428730186 record_changes_log
2e5488b6-5b97-4755-8a46-6a46317c1ae2 0 -1631086027 record_changes_log
7a645ffd-31c5-4000-ab66-a565e6dad7e0 0 1857654119 record_changes_log
I asked yesterday for some help on the comparison query, and it lead to this:
select v0.table_name,
v0.id,
v0.row_hash as v0,
v1.row_hash as v1
from row_check v0
join row_check v1 on v0.id = v1.id and
v0.version = 0 and
v1.version = 1 and
v0.row_hash <> v1.row_hash
That works, but now I'm hoping to optimize it a bit. As an experiment, I clustered the data on version and then built a BRIN index, like this:
drop index if exists row_check_version_btree;
create index row_check_version_btree
on row_check
using btree(version);
cluster row_check using row_check_version_btree;
drop index row_check_version_btree; -- Eh? I want to see how the BRIN performs.
drop index if exists row_check_version_brin;
create index row_check_version_brin
on row_check
using brin(row_hash);
vacuum analyze row_check;
I ran the query through explain analyze and get this:
Merge Join (cost=1.12..559750.04 rows=4437567 width=51) (actual time=1511.987..14884.045 rows=10 loops=1)
Output: v0.table_name, v0.id, v0.row_hash, v1.row_hash
Inner Unique: true
Merge Cond: (v0.id = v1.id)
Join Filter: (v0.row_hash <> v1.row_hash)
Rows Removed by Join Filter: 4318290
Buffers: shared hit=8679005 read=42511
-> Index Scan using row_check_pkey on ascendco.row_check v0 (cost=0.56..239156.79 rows=4252416 width=43) (actual time=0.032..5548.180 rows=4318300 loops=1)
Output: v0.id, v0.version, v0.row_hash, v0.table_name
Index Cond: (v0.version = 0)
Buffers: shared hit=4360752
-> Index Scan using row_check_pkey on ascendco.row_check v1 (cost=0.56..240475.33 rows=4384270 width=24) (actual time=0.031..6070.790 rows=4318300 loops=1)
Output: v1.id, v1.version, v1.row_hash, v1.table_name
Index Cond: (v1.version = 1)
Buffers: shared hit=4318253 read=42511
Planning Time: 1.073 ms
Execution Time: 14884.121 ms
...which I did not really get the right idea from...so I ran it again to JSON and fed the results into this wonderful plan visualizer:
http://tatiyants.com/pev/#/plans
The tips there are right, the top node estimate is bad. The result is 10 rows, the estimate is for about 443,757 rows.
I'm hoping to learn more about optimizing this kind of thing, and this query seems like a good opportunity. I have a lot of notions about what might help:
-- CREATE STATISTICS?
-- Rework the query to move the where comparison?
-- Use a better index? I did try a GIN index and a straight B-tree on version, but neither was superior.
-- Rework the row_check format to move the two hashes into the same row instead of splitting them over two rows, compare on insert/update, flag non-matches, and add a partial index for the non-matching values.
Granted, it's funny to even try to index something where there are only two values (0 and 1 in the case above), so there's that. In fact, is there any sort of clever trick for Booleans? I'll always be comparing two versions, so "old" and "new", which I can express however makes life best. I understand that Postgres only has bitmap indexes internally at search/merge (?) time and that it does not have a bitmap type index. Would there be some kind of INTERSECT that might help? I don't know how Postgres implements set math operators internally.
Thanks for any suggestions on how to rethink this data or the query to make it faster for comparisons involving millions, or tens of millions, of rows.
I'm going to add an answer to my own question, but am still interested in what anyone else has to say. In the process of writing out my original question, I realized that maybe a redesign is in order. This hinges on my plan to only ever compare two versions at a time. That's a good solution here, but there are other cases where it wouldn't work. Anyway, here's a slightly different table design that folds the two results into a single row:
DROP TABLE IF EXISTS data.row_compare;
CREATE TABLE IF NOT EXISTS data.row_compare (
id uuid NOT NULL DEFAULT NULL,
hash_1 int8, -- Want NULL to defer calculating hash comparison until after both hashes are entered.
hash_2 int8, -- Ditto
hashes_match boolean, -- Likewise
table_name text NOT NULL DEFAULT NULL,
CONSTRAINT row_compare_pkey
PRIMARY KEY (id)
);
The following expression index should, hopefully, be very small as I shouldn't have any non-matching entries:
CREATE INDEX row_compare_fail ON row_compare (hashes_match)
WHERE hashes_match = false;
The trigger below does the column calculation, once hash_1 and hash_2 are both provided:
-- Run this as a BEFORE INSERT or UPDATE ROW trigger.
CREATE OR REPLACE FUNCTION data.on_upsert_row_compare()
RETURNS trigger AS
$BODY$
BEGIN
IF NEW.hash_1 = NULL OR
NEW.hash_2 = NULL THEN
RETURN NEW; -- Don't do the comparison, hash_1 hasn't been populated yet.
ELSE-- Do the comparison. The point of this is to avoid constantly thrashing the expression index.
NEW.hashes_match := NEW.hash_1 = NEW.hash_2;
RETURN NEW; -- important!
END IF;
END;
$BODY$
LANGUAGE plpgsql;
This now adds 4.3M rows instead of 8.6M rows:
-- Add the first set of results and build out the row_compare records.
INSERT INTO row_compare (id,hash_1,table_name)
SELECT id, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_compare_pkey DO UPDATE SET
hash_1 = EXCLUDED.hash_1,
table_name = EXCLUDED.table_name;
-- I'll truncate the record_changes_log and push my sample data again here.
-- Add the second set of results and update the row compare records.
-- This time, the hash is going into the hash_2 field for comparison
INSERT INTO row_compare (id,hash_2,table_name)
SELECT id, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_compare_pkey DO UPDATE SET
hash_2 = EXCLUDED.hash_2,
table_name = EXCLUDED.table_name;
And now the results are simple to find:
select * from row_compare where hashes_match = false;
This changes the query time from around 17 seconds to around 24 milliseconds.

Select query became very very very slow in postgresql

I have one table which contains "133,072,194" records and I am trying to execute
SELECT COUNT(test)
FROM mytable
WHERE test = false
but it is taking Execution time: 128320.712 ms
I already have indexing on test column. Could you please let me know, what I can optimize or change, so my query became faster?
Because of this, my other select query is also not working.
If there are many rows where test is FALSE, you won't be able to get an exact result faster than with a sequential scan, which is slow for big tables.
If you have only few rows that satisfy the condition, you should create a partial index:
CREATE INDEX mytable_notest_ind ON mytable(id) WHERE NOT test;
(assuming that id is the primary key) and keep mytable autovacuumed often enough that you get an index only scan.
But usually exact results for queries like this are not required.
You could calculate an estimated count from the table statistics with a query like this:
SELECT t.reltuples
* (1 - t.nullfrac)
* mcv.freq AS count_false
FROM pg_stats AS s
CROSS JOIN LATERAL unnest(s.most_common_vals::text::boolean[],
s.most_common_freqs) AS mcv(val, freq)
JOIN pg_class AS t
ON s.tablename = t.relname
AND s.schemaname = t.relnamespace::regnamespace::text
WHERE s.tablename = 'mytable'
AND s.attname = 'test'
AND mcv.val = FALSE;
That would be very fast.
See my blog post for more considerations about the speed of SELECT count(*).

Postgres: Sorting by an immutable function index doesn't use index

I have a simple table.
CREATE TABLE posts
(
id uuid NOT NULL,
vote_up_count integer,
vote_down_count integer,
CONSTRAINT post_pkey PRIMARY KEY(id)
);
I have an IMMUTABLE function that does simple (but could be complex) arithmetic.
CREATE OR REPLACE FUNCTION score(
ups integer,
downs integer)
RETURNS integer AS
$BODY$
select $1 - $2
$BODY$
LANGUAGE sql IMMUTABLE
COST 100;
ALTER FUNCTION score(integer, integer)
OWNER TO postgres;
I create an index on the posts table that uses my function.
CREATE INDEX posts_score_index ON posts(score(vote_up_count, vote_down_count), date_created);
When I EXPLAIN the following query, it doesn't seem to be using the index.
SELECT * FROM posts ORDER BY score(vote_up_count, vote_down_count), date_created
Sort (cost=1.02..1.03 rows=1 width=310)
Output: id, date_created, last_edit_date, slug, sub_id, user_id, user_ip, type, title, content, url, domain, send_replies, vote_up_count, vote_down_count, verdict, approved_by, removed_by, verdict_message, number_of_reports, ignore_reports, number_of_com (...)"
Sort Key: ((posts.vote_up_count - posts.vote_down_count)), posts.date_created
-> Seq Scan on public.posts (cost=0.00..1.01 rows=1 width=310)
Output: id, date_created, last_edit_date, slug, sub_id, user_id, user_ip, type, title, content, url, domain, send_replies, vote_up_count, vote_down_count, verdict, approved_by, removed_by, verdict_message, number_of_reports, ignore_reports, number_ (...)
How do I get my ORDER BY to use an index from an IMMUTABLE function that could have some very complex arithmetic?
EDIT: Based off of #Егор-Рогов's suggestions, I change the query a bit to see if I can get it to use an index. Still no luck.
set enable_seqscan=off;
EXPLAIN VERBOSE select date_created from posts ORDER BY (hot(vote_up_count, vote_down_count, date_created),date_created);
Here is the output.
Sort (cost=10000000001.06..10000000001.06 rows=1 width=16)
Output: date_created, (ROW(round((((log((GREATEST(abs((vote_up_count - vote_down_count)), 1))::double precision) * sign(((vote_up_count - vote_down_count))::double precision)) + ((date_part('epoch'::text, date_created) - 1134028003::double precision) / 4 (...)
Sort Key: (ROW(round((((log((GREATEST(abs((posts.vote_up_count - posts.vote_down_count)), 1))::double precision) * sign(((posts.vote_up_count - posts.vote_down_count))::double precision)) + ((date_part('epoch'::text, posts.date_created) - 1134028003::dou (...)
-> Seq Scan on public.posts (cost=10000000000.00..10000000001.05 rows=1 width=16)
Output: date_created, ROW(round((((log((GREATEST(abs((vote_up_count - vote_down_count)), 1))::double precision) * sign(((vote_up_count - vote_down_count))::double precision)) + ((date_part('epoch'::text, date_created) - 1134028003::double precision (...)
EDIT2: It seems that I was not using the index because of a second order by with date_created.
I can see a couple of points that discourages the planner from using the index.
1.
Look at this line in the explain output:
Seq Scan on public.posts (cost=0.00..1.01 rows=1 width=310)
It says that the planner believes there is only one row in the table. In this case it makes no sense to use index scan, for sequential scan is faster.
Try to add more rows to the table, do analyze and try again. You can also test it by temporarily disabling sequential scans by set enable_seqscan=off;.
2.
You use your the function to sort the results. So the planner may decide to use the index in order to get tuple ids in the correct order. But then it needs to fetch each tuple from the table to get values of all columns (because of select *).
You can make the index more attractive to the planner by adding all necessary columns to it, which make possible to avoid table scan. This is called index-only scan.
CREATE INDEX posts_score_index ON posts(
score(vote_up_count, vote_down_count),
date_created,
id, -- do you actually need it in result set?
vote_up_count, -- do you actually need it in result set?
vote_down_count -- do you actually need it in result set?
);
And make sure you run vacuum after inserting/updating/deleting rows to update the visibility map.
The downside is the increased index size, of course.

Redshift SELECT * performance versus COUNT(*) for non existent row

I am confused about what Redshift is doing when I run 2 seemingly similar queries. Neither should return a result (querying a profile that doesn't exist). Specifically:
SELECT * FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 36.75s
versus
SELECT COUNT(*) FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 0.2s
Given that the table is sorted by project_id then id I would have thought this is just a key lookup. The SELECT COUNT(*) ... returns 0 results in 0.2sec which is about what I would expect. The SELECT * ... returns 0 results in 37.75sec. That's a huge difference for the same result and I don't understand why?
If it helps schema as follows:
CREATE TABLE profile (
project_id integer not null,
id varchar(256) not null,
created timestamp not null,
/* ... approx 50 other columns here */
)
DISTKEY(id)
SORTKEY(project_id, id);
Explain from SELECT COUNT(*) ...
XN Aggregate (cost=435.70..435.70 rows=1 width=0)
-> XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=0)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Explain from SELECT * ...
XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=7356)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Why is the non count much slower? Surely Redshift knows the row doesn't exist?
The reason is: in many RDBMS's the answer on count(*) question usually come without actual data scan: just from index or table statistics. Redshift stores minimal and maximal value for a block that used to give exist or not exists answers for example like in describer case. In case requested value inside of min/max block boundaries the scan will be performed only on filtering fields data. In case requested value is lower or upper block boundaries the answer will be given much faster on basis of the stored statistics. In case of "select * " question RedShift actually scans all columns data as asked in query: "*" but filter only by columns in "where " clause.