I have a table that is getting too big and I want to reduce it's size
with an UPDATE query. Some of the data in this table is redundant, and
I should be able to reclaim a lot of space by setting the redundant
"cells" to NULL. However, my UPDATE queries are taking excessive
amounts of time to complete.
Table details
-- table1 10M rows (estimated)
-- 45 columns
-- Table size 2200 MB
-- Toast Table size 17 GB
-- Indexes Size 1500 MB
-- **columns in query**
-- id integer primary key
-- testid integer foreign key
-- band integer
-- date timestamptz indexed
-- data1 real[]
-- data2 real[]
-- data3 real[]
This was my first attempt at an update query. I broke it up into some
temporary tables just to get the id's to update. Further, to reduce the
query, I selected a date range for June 2020
CREATE TEMP TABLE A as
SELECT testid
FROM table1
WHERE date BETWEEN '2020-06-01' AND '2020-07-01'
AND band = 3;
CREATE TEMP TABLE B as -- this table has 180k rows
SELECT id
FROM table1
WHERE date BETWEEN '2020-06-01' AND '2020-07-01'
AND testid in (SELECT testid FROM A)
AND band > 1
UPDATE table1
SET data1 = Null, data2 = Null, data3 = Null
WHERE id in (SELECT id FROM B)
Queries for creating TEMP tables execute in under 1 sec. I ran the UPDATE query for an hour(!) before I finally killed it. Only 180k
rows needed to be updated. It doesn't seem like it should take that much
time to update that many rows. Temp table B identifies exactly which
rows to update.
Here is the EXPLAIN from the above UPDATE query. One of the odd features of this explain is that it shows 4.88M rows, but there are only 180k rows to update.
Update on table1 (cost=3212.43..4829.11 rows=4881014 width=309)
-> Nested Loop (cost=3212.43..4829.11 rows=4881014 width=309)
-> HashAggregate (cost=3212.00..3214.00 rows=200 width=10)
-> Seq Scan on b (cost=0.00..2730.20 rows=192720 width=10)
-> Index Scan using table1_pkey on table1 (cost=0.43..8.07 rows=1 width=303)
Index Cond: (id = b.id)
Another way to run this query is in one shot:
WITH t as (
SELECT id from table1
WHERE testid in (
SELECT testid
from table1
WHERE date BETWEEN '2020-06-01' AND '2020-07-01'
AND band = 3
)
)
UPDATE table1 a
SET data1 = Null, data2 = Null, data3 = Null
FROM t
WHERE a.id = t.id
I only ran this one for about 10 minutes before I killed it. It feels like I should be able to run this query in much less time if I just knew the tricks. This query has EXPLAIN below. This explain shows 195k rows which is more expected, but cost is much higher # 1.3M to 1.7M
Update on testlog a (cost=1337986.60..1740312.98 rows=195364 width=331)
CTE t
-> Hash Join (cost=8834.60..435297.00 rows=195364 width=4)
Hash Cond: (testlog.testid = testlog_1.testid)
-> Seq Scan on testlog (cost=0.00..389801.27 rows=9762027 width=8)
-> Hash (cost=8832.62..8832.62 rows=158 width=4)"
-> HashAggregate (cost=8831.04..8832.62 rows=158 width=4)
-> Index Scan using amptest_testlog_date_idx on testlog testlog_1 (cost=0.43..8820.18 rows=4346 width=4)
Index Cond: ((date >= '2020-06-01 00:00:00-07'::timestamp with time zone) AND (date <= '2020-07-01 00:00:00-07'::timestamp with time zone))
Filter: (band = 3)
-> Hash Join (cost=902689.61..1305015.99 rows=195364 width=331)
Hash Cond: (t.id = a.id)
-> CTE Scan on t (cost=0.00..3907.28 rows=195364 width=32)
-> Hash (cost=389801.27..389801.27 rows=9762027 width=303)
-> Seq Scan on testlog a (cost=0.00..389801.27 rows=9762027 width=303)
Edit: one of the suggestions in the accepted answer was to drop any indexes before the update and then add them back later. This is what I went with, with a twist: I needed another table to hold indexed data from the dropped indexes to make the A and B queries faster:
CREATE TABLE tempid AS
SELECT id, testid, band, date
FROM table1
I made indexes on this table for id, testid, and date. Then I replaced table1 in the A and B queries with tempid. It still went slower than I would have liked, but it did get the job done.
You might have another table that has a foreign key to this table to one or more columns you are setting to NULL. And this foreign table does not have an index on the column.
Each time you set the row value to NULL the database has to check the foreign table - maybe it has a row that references the value you are removing.
If this is the case you should be able to speed it up by adding an index on this remote table.
For example if you have a table like this:
create table table2 (
id serial primary key,
band integer references table1(data1)
)
Then you can create an index create index table2_band_nnull_idx on table2(band) where band is not null.
But you suggested that all columns you are setting to NULL have array type. This means that it is unlikely that they are referenced. Still it is worth checking.
Another possibility is that you have a trigger on the table that works slowly.
Another possibility is that you have a lot of indexes on the table. Each index has to be updated for each row you update and it can use only a single processor core.
Sometimes it is faster to drop all indexes, do the bulk update and then recreate them all back. Creating indexes can use multiple cores - one core per index.
Another possibility is that your query is waiting for some other query to finish and release its locks. You should check with:
select now()-query_start, * from pg_stat_activity where state<>'idle' order by 1;
If I have 6 columns (includes primary key id) and I need to query (where =) any combination of the columns, should I index them all individually? I think I should.
I index them individually instead of multicolumn indexes, because I will query one column at a time too. (any combination of columns including one at a time, and all columns at a time)
First of all, there is an EXPLAIN statement that should answer your question.
janbet=> create table aa (n1 integer, n2 integer);
CREATE TABLE
janbet=> create index on aa (n1);
CREATE INDEX
janbet=> create index on aa (n2);
CREATE INDEX
janbet=> insert into aa select (random() * 10^5)::integer, (random() * 10^4)::integer from generate_series(1, (10^4)::integer, 1);
INSERT 0 10000
janbet=> explain select from aa where n1 = 7;
QUERY PLAN
--------------------------------------------------------------------------
Index Only Scan using aa_n1_idx on aa (cost=0.40..32.41 rows=1 width=0)
Index Cond: (n1 = 7)
(2 rows)
janbet=> explain select from aa where n2 = 8;
QUERY PLAN
--------------------------------------------------------------------------
Index Only Scan using aa_n2_idx on aa (cost=0.40..32.41 rows=1 width=0)
Index Cond: (n2 = 8)
(2 rows)
janbet=> explain select from aa where n1 = 7 and n2 = 8;
QUERY PLAN
---------------------------------------------------------------------
Index Scan using aa_n2_idx on aa (cost=0.40..32.42 rows=1 width=0)
Index Cond: (n2 = 8)
Filter: (n1 = 7)
(3 rows)
As you can see in last query above, one index is used and other column value is filtered.
And the same will be in your case - PostgreSQL uses always at most one index. This is just how the indexes work - when you use the information from one index, there is no way of applying information from the other.
On the other hand, PostgreSQL works well with multicolumn indexes when only few first columns used. E.g. if you CREATE INDEX ON some_table (a, b, c), then WHERE a=7, WHERE a=7 and b=8 and WHERE a=7 and b=8 and c=9 will use the index, but e.g. WHERE b=7 will not.
Long story short: there is no way to ensure that index will be used for every combination of six variables other than creating 720 indexes (one for each order), but this is in no way a recommended solution : )
Here's a minimal setup with 2 tables a and b each with 3 rows:
CREATE TABLE a (
id SERIAL PRIMARY KEY,
value TEXT
);
CREATE INDEX ON a (value);
CREATE TABLE b (
id SERIAL PRIMARY KEY,
value TEXT
);
CREATE INDEX ON b (value);
INSERT INTO a (value) VALUES ('x'), ('y'), (NULL);
INSERT INTO b (value) VALUES ('y'), ('z'), (NULL);
Here is a LEFT JOIN that works fine as expected:
SELECT * FROM a
LEFT JOIN b ON a.value IS NOT DISTINCT FROM b.value;
with output:
id | value | id | value
----+-------+----+-------
1 | x | |
2 | y | 1 | y
3 | | 3 |
(3 rows)
Changing "LEFT JOIN" to "FULL JOIN" gives an error:
SELECT * FROM a
FULL JOIN b ON a.value IS NOT DISTINCT FROM b.value;
ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions
Can someone please answer:
What is a "merge-joinable or hash-joinable join condition" and why joining on a.value IS NOT DISTINCT FROM b.value doesn't fulfill this condition, but a.value = b.value is perfectly fine?
It seems that the only difference is how NULL values are handled. Since the value column is indexed in both tables, running an EXPLAIN on a NULL lookup is just as efficient as looking up values that are non-NULL:
EXPLAIN SELECT * FROM a WHERE value = 'x';
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=4.20..13.67 rows=6 width=36)
Recheck Cond: (value = 'x'::text)
-> Bitmap Index Scan on a_value_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (value = 'x'::text)
EXPLAIN SELECT * FROM a WHERE value ISNULL;
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=4.20..13.65 rows=6 width=36)
Recheck Cond: (value IS NULL)
-> Bitmap Index Scan on a_value_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (value IS NULL)
This has been tested with PostgreSQL 9.6.3 and 10beta1.
There has been discussion about this issue, but it doesn't directly answer the above question.
PostgreSQL implements FULL OUTER JOIN with either a hash or a merge join.
To be eligible for such a join, the join condition has to have the form
<expression using only left table> <operator> <expression using only right table>
Now your join condition does look like this, but PostgreSQL does not have a special IS NOT DISTINCT FROM operator, so it parses your condition into:
(NOT ($1 IS DISTINCT FROM $2))
And such an expression cannot be used for hash or merge joins, hence the error message.
I can think of a way to work around it:
SELECT a_id, NULLIF(a_value, '<null>'),
b_id, NULLIF(b_value, '<null>')
FROM (SELECT id AS a_id,
COALESCE(value, '<null>') AS a_value
FROM a
) x
FULL JOIN
(SELECT id AS b_id,
COALESCE(value, '<null>') AS b_value
FROM b
) y
ON x.a_value = y.b_value;
That works if <null> does not appear anywhere in the value columns.
I just solved such a case by replacing the ON condition with "TRUE", and moving the original "ON" condition into a WHERE clause. I don't know the performance impact of this, though.
Why:
We were discussing with colleague composite keys. The fact that Mysql requires sequential use of indexed columns in where clause without gaps to be used by planner. I wanted to show how Postgres uses second column in composite key for scan. And I failed! It was using first column, but not the second! Totally confused I played some and found it starts using the second column when index is 6.5 times smaller then table:
populate:
create table so2 (a int not null,b int not null, c text, d int not null);
with l as (select generate_series(999,999+76,1) r)
insert into so2
select l.r,l.r+1,concat('l',lpad('o',l.r,'o'),'ng'),1 from l;
;
alter table so2 ADD CONSTRAINT so2pk PRIMARY KEY (a,b);
analyze so2;
the plan that confused me:
t=# explain analyze select 42 from so2 where a=1004;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Index Only Scan using so2pk on so2 (cost=0.14..8.16 rows=1 width=0) (actual time=0.013..0.013 rows=1 loops=1)
Index Cond: (a = 1004)
Heap Fetches: 1
Planning time: 0.090 ms
Execution time: 0.026 ms
(5 rows)
t=# explain analyze select 42 from so2 where b=1004;
QUERY PLAN
----------------------------------------------------------------------------------------------
Seq Scan on so2 (cost=0.00..11.96 rows=1 width=0) (actual time=0.006..0.028 rows=1 loops=1)
Filter: (b = 1004)
Rows Removed by Filter: 76
Planning time: 0.045 ms
Execution time: 0.036 ms
(5 rows)
Then I drop so2 and rerun prepare part with 999+77, not 999+76 and plan for column b changes:
t=# explain analyze select 42 from so2 where b=1004;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Index Only Scan using so2pk on so2 (cost=0.14..12.74 rows=1 width=0) (actual time=0.004..0.004 rows=1 loops=1)
Index Cond: (b = 1004)
Heap Fetches: 1
Planning time: 0.038 ms
Execution time: 0.013 ms
(5 rows)
The only difference I noticed is amount of pages the relation takes:
confusing plan' size:
t=# \dt+ so2
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+-------+--------+-------------
public | so2 | table | vao | 120 kB |
(1 row)
expected one's size:
t=# \dt+ so2
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+-------+--------+-------------
public | so2 | table | vao | 128 kB |
(1 row)
Index in both case is same:
t=# \di+ so2pk
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+-------+-------+-------+-------+-------+-------------
public | so2pk | index | vao | so2 | 16 kB |
(1 row)
Settings that could affect plan are default:
select name,setting
from pg_settings
where source != 'default' and name in (
'enable_bitmapscan',
'enable_hashagg',
'enable_hashjoin',
'enable_indexscan',
'enable_indexonlyscan',
'enable_material',
'enable_mergejoin',
'enable_nestloop',
'enable_seqscan',
'enable_sort',
'enable_tidscan',
'seq_page_cost',
'random_page_cost',
'cpu_tuple_cost',
'cpu_index_tuple_cost',
'cpu_operator_cost',
'effective_cache_size',
'geqo',
'geqo_threshold',
'geqo_effort',
'geqo_pool_size',
'geqo_generations',
'geqo_selection_bias',
'geqo_seed',
'join_collapse_limit',
'from_collapse_limit',
'cursor_tuple_fraction',
'constraint_exclusion',
'default_statistics_target'
) order by name
;
name | setting
------+---------
(0 rows)
tried in several versions: 9.3.10, 9.5.4 with same behaviour
Now - excuse me for such long post! And questions:
16kB is smaller then 120kB - why would planner choose Seq Scan?..
UPDATED to reflect e4c5 kend remarks
Also:
For a second I thought it could be because text column is kept in extended stoprage so the table itself takes same amount of pages as index (all columns but text one), so I altered it to be main and plain - no effect...
I think the key difference here is the fact that mysql can only use one index per table where as postgresql does not have that limitation. More than one index can be used that's perhaps why they say
Multicolumn indexes should be used sparingly. In most situations, an
index on a single column is sufficient and saves space and time.
Indexes with more than three columns are unlikely to be helpful unless
the usage of the table is extremely stylized.
A few other pointers
1) You have far too little data to reach any conclusion. yes, the query plan does depend a great deal on size - the number of rows in the table. The first stored function creates only 62 rows and for that you don't need an index.
2) The value that you are search for a=4 is not in the table.
3) the value for b is always one, so an index on that column would be useless. Even a composite index wouldn't give it very high cardinality. ie a composite index on (a,b) is exactly the same as an index on a
Update
There is no line in the sand threshold at number of rows or which size in KB an index becomes effective. Query planner decides whether or not to use an index and what index to use based on several configuration factors described at https://www.postgresql.org/docs/9.2/static/runtime-config-query.html
I'm stuck with a (Postgres 9.4.6) query that a) is using too much memory (most likely due to array_agg()) and also does not return exactly what I need, making post-processing necessary. Any input (especially regarding memory consumption) is highly appreciated.
Explanation:
the table token_groups holds all words used in tweets I've parsed with their respective occurrence frequency in a hstore, with one row per 10 minutes (for the last 7 days, so 7*24*6 rows in total). These rows are inserted in order of tweeted_at, so I can simply order by id. I'm using row_numberto identify when a word occurred.
# \d token_groups
Table "public.token_groups"
Column | Type | Modifiers
------------+-----------------------------+-----------------------------------------------------------
id | integer | not null default nextval('token_groups_id_seq'::regclass)
tweeted_at | timestamp without time zone | not null
tokens | hstore | not null default ''::hstore
Indexes:
"token_groups_pkey" PRIMARY KEY, btree (id)
"index_token_groups_on_tweeted_at" btree (tweeted_at)
What I'd ideally want is a list of words with each the relative distances of their row numbers. So if e.g. the word 'hello' appears in row 5 once, in row 8 twice and in row 20 once, I'd want a column with the word, and an array column returning {5,3,0,12}. (meaning: first occurrence in fifth row, next occurrence 3 rows later, next occurrence 0 rows later, next 12 rows later). If anyone wonders why: 'relevant' words occur in clusters, so (simplified) the higher the standard deviation of timely distances, the more likely a word is a keyword. See more here: http://bioinfo2.ugr.es/Publicaciones/PRE09.pdf
For now, I return an array with positions and an array with frequencies, and use this info to calculate the distances in ruby.
Currently the primary problem is a high memory spike, which seems to be caused by array_agg(). As I'm being told by the (very helpful) heroku staff that some of my connections use 500-700MB with very little shared memory, causing out of memory errors (I'm running Standard-0, which gives me 1GB total for all connections), I need to find an optimization.
The total number of hstore entries is ~100k, which then is aggregated (after skipping words with very low frequency):
SELECT COUNT(*)
FROM (SELECT row_number() over(ORDER BY id ASC) AS position,
(each(tokens)).key, (each(tokens)).value::integer
FROM token_groups) subquery;
count
--------
106632
Here is the query causing the memory load:
SELECT key, array_agg(pos) AS positions, array_agg(value) AS frequencies
FROM (
SELECT row_number() over(ORDER BY id ASC) AS pos,
(each(tokens)).key,
(each(tokens)).value::integer
FROM token_groups
) subquery
GROUP BY key
HAVING SUM(value) > 10;
The output is:
key | positions | frequencies
-------------+---------------------------------------------------------+-------------------------------
hello | {172,185,188,210,349,427,434,467,479} | {1,2,1,1,2,1,2,1,4}
world | {166,218,265,343,415,431,436,493} | {1,1,2,1,2,1,2,1}
some | {35,65,101,180,193,198,223,227,420,424,427,428,439,444} | {1,1,1,1,1,1,1,2,1,1,1,1,1,1}
other | {77,111,233,416,421,494} | {1,1,4,1,2,2}
word | {170,179,182,184,185,186,187,188,189,190,196} | {3,1,1,2,1,1,1,2,5,3,1}
(...)
Here's what explain says:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=12789.00..12792.50 rows=200 width=44) (actual time=309.692..343.064 rows=2341 loops=1)
Output: ((each(token_groups.tokens)).key), array_agg((row_number() OVER (?))), array_agg((((each(token_groups.tokens)).value)::integer))
Group Key: (each(token_groups.tokens)).key
Filter: (sum((((each(token_groups.tokens)).value)::integer)) > 10)
Rows Removed by Filter: 33986
Buffers: shared hit=2176
-> WindowAgg (cost=177.66..2709.00 rows=504000 width=384) (actual time=0.947..108.157 rows=106632 loops=1)
Output: row_number() OVER (?), (each(token_groups.tokens)).key, ((each(token_groups.tokens)).value)::integer, token_groups.id
Buffers: shared hit=2176
-> Sort (cost=177.66..178.92 rows=504 width=384) (actual time=0.910..1.119 rows=504 loops=1)
Output: token_groups.id, token_groups.tokens
Sort Key: token_groups.id
Sort Method: quicksort Memory: 305kB
Buffers: shared hit=150
-> Seq Scan on public.token_groups (cost=0.00..155.04 rows=504 width=384) (actual time=0.013..0.505 rows=504 loops=1)
Output: token_groups.id, token_groups.tokens
Buffers: shared hit=150
Planning time: 0.229 ms
Execution time: 570.534 ms
PS: if anyone wonders: every 10 minutes I append new data to the token_groupstable and remove outdated data. Which is easy when storing data one row per 10 minutes, I still have to come up with a better data structure that e.g. uses one row per word. But that does not seem to be the main issue, I think it's the array aggregation.
Your presented query can be simpler, evaluating each() only once per row:
SELECT key, array_agg(pos) AS positions, array_agg(value) AS frequencies
FROM (
SELECT t.key, pos, t.value::int
FROM (SELECT row_number() OVER (ORDER BY id) AS pos, * FROM token_groups) tg
, each(g.tokens) t -- implicit LATERAL join
ORDER BY t.key, pos
) sub
GROUP BY key
HAVING sum(value) > 10;
Also preserving correct order of elements.
What I'd ideally want is a list of words with each the relative distances of their row numbers.
This would do it:
SELECT key, array_agg(step) AS occurrences
FROM (
SELECT key, CASE WHEN g = 1 THEN pos - last_pos ELSE 0 END AS step
FROM (
SELECT key, value::int, pos
, lag(pos, 1, 0) OVER (PARTITION BY key ORDER BY pos) AS last_pos
FROM (SELECT row_number() OVER (ORDER BY id)::int AS pos, * FROM token_groups) tg
, each(g.tokens) t
) t1
, generate_series(1, t1.value) g
ORDER BY key, pos, g
) sub
GROUP BY key;
HAVING count(*) > 10;
SQL Fiddle.
Interpreting each hstore key as a word and the respective value as number of occurrences in the row (= for the last 10 minutes), I use two cascading LATERAL joins: 1st step to decompose the hstore value, 2nd step to multiply rows according to value. (If your value (frequency) is mostly just 1, you can simplify.) About LATERAL:
What is the difference between LATERAL and a subquery in PostgreSQL?
Then I ORDER BY key, pos, g in the subquery before aggregating in the outer SELECT. This clause seems to be redundant, and in fact, I see the same result without it in my tests. That's a collateral benefit from the window definition of lag() in the inner query, which is carried over to the next step unless any other step triggers re-ordering. However, now we depend on an implementation detail that's not guaranteed to work.
Ordering the whole query once should be substantially faster (and easier on the required sort memory) than per-aggregate sorting. This is not strictly according to the SQL standard either, but the simple case is documented for Postgres:
Alternatively, supplying the input values from a sorted subquery will usually work. For example:
SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
But this syntax is not allowed in the SQL standard, and is not portable to other database systems.
Strictly speaking, we only need:
ORDER BY pos, g
You could experiment with that. Related:
PostgreSQL unnest() with element number
Possible alternative:
SELECT key
, ('{' || string_agg(step || repeat(',0', value - 1), ',') || '}')::int[] AS occurrences
FROM (
SELECT key, pos, value::int
,(pos - lag(pos, 1, 0) OVER (PARTITION BY key ORDER BY pos))::text AS step
FROM (SELECT row_number() OVER (ORDER BY id)::int AS pos, * FROM token_groups) g
, each(g.tokens) t
ORDER BY key, pos
) t1
GROUP BY key;
-- HAVING sum(value) > 10;
Might be cheaper to use text concatenation instead of generate_series().