Matching on postal code and city name - very slow in PostgreSQL - postgresql

I am trying to update address fields in mytable with data from othertable.
If I match on postal codes and search for city names from othertable in mytable, it works reasonably fast. But as I don't have postal codes in all cases, I also want to look for names only in a 2nd query. This takes hours (>12h). Any ideas what I can do to speed up the query? Please note that indexing did not help. Index scans in (2) weren't faster.
Code for matching on postal code + name (1)
update mytable t1 set
admin1 = t.admin1,
admin2 = t.admin2,
admin3 = t.admin3,
postal_code = t.postal_code,
lat = t.lat,
lng = t.lng from (
select * from othertable) t
where t.postal_code = t1.postal_code and t1.country = t.country
and upper(t1.address) like '%' || t.admin1 || '%' --looks whether city name from othertable shows up in address in t1
and admin1 is null;
Code for matching on name only (2)
update mytable t1 set
admin1 = t.admin1,
admin2 = t.admin2,
admin3 = t.admin3,
postal_code = t.postal_code,
lat = t.lat,
lng = t.lng from (
select * from othertable) t
where t1.country = t.country
and upper(t1.address) like '%' || t.admin1 || '%' --looks whether city name from othertable shows up in address in t1
and admin1 is null;
Query plan 1:
"Update on mytable t1 (cost=19084169.53..19205622.16 rows=13781 width=1918)"
" -> Merge Join (cost=19084169.53..19205622.16 rows=13781 width=1918)"
" Merge Cond: (((t1.postal_code)::text = (othertable.postal_code)::text) AND (t1.country = othertable.country))"
" Join Filter: (upper((t1.address)::text) ~~ (('%'::text || othertable.admin1) || '%'::text))"
" -> Sort (cost=18332017.34..18347693.77 rows=6270570 width=1661)"
" Sort Key: t1.postal_code, t1.country"
" -> Seq Scan on mytable t1 (cost=0.00..4057214.31 rows=6270570 width=1661)"
" Filter: (admin1 IS NULL)"
" -> Materialize (cost=752152.19..766803.71 rows=2930305 width=92)"
" -> Sort (cost=752152.19..759477.95 rows=2930305 width=92)"
" Sort Key: othertable.postal_code, othertable.country"
" -> Seq Scan on othertable (cost=0.00..136924.05 rows=2930305 width=92)"
Query plan 2:
"Update on mytable t1 (cost=19084169.53..27246633167.33 rows=5464884210 width=1918)"
" -> Merge Join (cost=19084169.53..27246633167.33 rows=5464884210 width=1918)"
" Merge Cond: (t1.country = othertable.country)"
" Join Filter: (upper((t1.address)::text) ~~ (('%'::text || othertable.admin1) || '%'::text))"
" -> Sort (cost=18332017.34..18347693.77 rows=6270570 width=1661)"
" Sort Key: t1.country"
" -> Seq Scan on mytable t1 (cost=0.00..4057214.31 rows=6270570 width=1661)"
" Filter: (admin1 IS NULL)"
" -> Materialize (cost=752152.19..766803.71 rows=2930305 width=92)"
" -> Sort (cost=752152.19..759477.95 rows=2930305 width=92)"
" Sort Key: othertable.country"
" -> Seq Scan on othertable (cost=0.00..136924.05 rows=2930305 width=92)"

In the second query, you are joining (more or less) on city name, but the othertable has several entries per city name, so you are updating mytable several times per record, with unpredictable value (which lat-long or other admin2/3 will be the last one to be updated?)
If othertable has entries without postal code, use them by adding an extra condition AND othertable.posalcode is null
Else, you will want to get a subset of othertable that returns one row per admin1 + country value. You would replace select * from othertable by the following query. Of course you might want to adjust it to get another lat/long/admin2-3 than the 1st one..
SELECT admin1, country, first(postal_code) postal_code, first(lat) lat, first(lng) lng, first(admin2) admin2, first(admin3) admin3
FROM othertable
GROUP BY admin1,country
Worst, the second query overwrite what was updated in the 1st query, so you must ignore these records by adding and mytable.postalcode is null
The entire query could be
UPDATE mytable t1
SET
admin1 = t.admin1,
admin2 = t.admin2,
admin3 = t.admin3,
postal_code = t.postal_code,
lat = t.lat,
lng = t.lng
FROM (
SELECT admin1, country, first(postal_code) postal_code, first(lat) lat, first(lng) lng, first(admin2) admin2, first(admin3) admin3
FROM othertable
GROUP BY admin1,country) t
WHERE t1.country = t.country
AND upper(t1.address) like '%' || t.admin1 || '%' --looks whether city name from othertable shows up in address in t1
AND admin1 is null
AND mytable.postal_code is null;

Related

How/When do views generate columns

Consider the following setup:
CREATE TABLE person (
first_name varchar,
last_name varchar,
age INT
);
INSERT INTO person (first_name, last_name, age)
VALUES ('pete', 'peterson', 16),
('john', 'johnson', 20),
('dick', 'dickson', 42),
('rob', 'robson', 30);
Create OR REPLACE VIEW adult_view AS
SELECT
first_name || ' ' || last_name as full_name,
age
FROM person;
If I run:
SELECT *
FROM adult_view
WHERE age > 18;
Will the view generate the full_name column for pete even though he gets filtered out?
Similarly, if I run:
SELECT age
FROM adult_view;
Will the view generate any full_name columns at all?
Concerning your first query:
EXPLAIN (VERBOSE, COSTS OFF)
SELECT full_name
FROM adult_view
WHERE age > 18;
QUERY PLAN
══════════════════════════════════════════════════════════════════════════════════
Seq Scan on laurenz.person
Output: (((person.first_name)::text || ' '::text) || (person.last_name)::text)
Filter: (person.age > 18)
Query Identifier: 5675263059379476127
So the table is scanned, people under 19 are filtered out, and then the result is calculated. So full_name won't be computed for Pete.
Concerning your second query:
EXPLAIN (VERBOSE, COSTS OFF)
SELECT age
FROM adult_view
WHERE age > 18;
QUERY PLAN
════════════════════════════════════════
Seq Scan on laurenz.person
Output: person.age
Filter: (person.age > 18)
Query Identifier: -8981994317495194105
(4 rows)
full_name is not calculated at all.

Postgres: Optimisation for query "WHERE id IN (...)"

I have a table (2M+ records) which keeps track of a ledger.
Some entries add points, while others subtract points (there are only two kinds of entries). The entries which subtract points, always reference the (adding) entries they were subtracted from with referenceentryid. The adding entries would always have NULL in referenceentryid.
This table has a dead column which would be set to true by a worker when some additions was depleted or expired, or when the subtractions is pointing at a "dead" additions. Since the table has a partial index on dead=false, SELECT on live rows works pretty fast.
My problem is with the performance of the worker that sets dead to NULL.
The flow would be:
1. Get an entry for each addition which indicates the amount added, subtracted and whether or not it's expired.
2. Filter away entries which are both not expired and have more addition than subtraction.
3. Update dead=true on every row where either the id or the referenceentryid is in the filtered set of entries.
WITH entries AS
(
SELECT
additions.id AS id,
SUM(subtractions.amount) AS subtraction,
additions.amount AS addition,
additions.expirydate <= now() AS expired
FROM
loyalty_ledger AS subtractions
INNER JOIN
loyalty_ledger AS additions
ON
additions.id = subtractions.referenceentryid
WHERE
subtractions.dead = FALSE
AND subtractions.referenceentryid IS NOT NULL
GROUP BY
subtractions.referenceentryid, additions.id
), dead_entries AS (
SELECT
id
FROM
entries
WHERE
subtraction >= addition OR expired = TRUE
)
-- THE SLOW BIT:
SELECT
*
FROM
loyalty_ledger AS ledger
WHERE
ledger.dead = FALSE AND
(ledger.id IN (SELECT id FROM dead_entries) OR ledger.referenceentryid IN (SELECT id FROM dead_entries));
In the query above the inner part runs pretty fast (a few seconds) while the last part would run for ever.
I have the following indexes on the table:
CREATE TABLE IF NOT EXISTS loyalty_ledger (
id SERIAL PRIMARY KEY,
programid bigint NOT NULL,
FOREIGN KEY (programid) REFERENCES loyalty_programs(id) ON DELETE CASCADE,
referenceentryid bigint,
FOREIGN KEY (referenceentryid) REFERENCES loyalty_ledger(id) ON DELETE CASCADE,
customerprofileid bigint NOT NULL,
FOREIGN KEY (customerprofileid) REFERENCES customer_profiles(id) ON DELETE CASCADE,
amount int NOT NULL,
expirydate TIMESTAMPTZ,
dead boolean DEFAULT false,
expired boolean DEFAULT false
);
CREATE index loyalty_ledger_referenceentryid_idx ON loyalty_ledger (referenceprofileid) WHERE dead = false;
CREATE index loyalty_ledger_customer_program_idx ON loyalty_ledger (customerprofileid, programid) WHERE dead = false;
I'm trying to optimise the last part of the query.
EXPLAIN gives me the following:
"Index Scan using loyalty_ledger_referenceentryid_idx on loyalty_ledger ledger (cost=103412.24..4976040812.22 rows=986583 width=67)"
" Filter: ((SubPlan 3) OR (SubPlan 4))"
" CTE entries"
" -> GroupAggregate (cost=1.47..97737.83 rows=252177 width=25)"
" Group Key: subtractions.referenceentryid, additions.id"
" -> Merge Join (cost=1.47..91390.72 rows=341928 width=28)"
" Merge Cond: (subtractions.referenceentryid = additions.id)"
" -> Index Scan using loyalty_ledger_referenceentryid_idx on loyalty_ledger subtractions (cost=0.43..22392.56 rows=341928 width=12)"
" Index Cond: (referenceentryid IS NOT NULL)"
" -> Index Scan using loyalty_ledger_pkey on loyalty_ledger additions (cost=0.43..80251.72 rows=1683086 width=16)"
" CTE dead_entries"
" -> CTE Scan on entries (cost=0.00..5673.98 rows=168118 width=4)"
" Filter: ((subtraction >= addition) OR expired)"
" SubPlan 3"
" -> CTE Scan on dead_entries (cost=0.00..3362.36 rows=168118 width=4)"
" SubPlan 4"
" -> CTE Scan on dead_entries dead_entries_1 (cost=0.00..3362.36 rows=168118 width=4)"
Seems like the last part of my query is very inefficient. Any ideas on how to speed it up?
For large datasets, I have found semi-joins to have much better performance than query in-lists:
from
loyalty_ledger as ledger
WHERE
ledger.dead = FALSE AND (
exists (
select null
from dead_entries d
where d.id = ledger.id
) or
exists (
select null
from dead_entries d
where d.id = ledger.referenceentryid
)
)
I honestly don't know, but I think each of these would also be worth a try. It's less code and more intuitive, but there is no guarantee they will work better:
ledger.dead = FALSE AND
exists (
select null
from dead_entries d
where d.id = ledger.id or d.id = ledger.referenceentryid
)
or
ledger.dead = FALSE AND
exists (
select null
from dead_entries d
where d.id in (ledger.id, ledger.referenceentryid)
)
What helped me in the end was to do the id IN filtering part in the second WITH step, replacing IN with ANY syntax:
WITH entries AS
(
SELECT
additions.id AS id,
additions.amount - coalesce(SUM(subtractions.amount),0) AS balance,
additions.expirydate <= now() AS passed_expiration
FROM
loyalty_ledger AS additions
LEFT JOIN
loyalty_ledger AS subtractions
ON
subtractions.dead = FALSE AND
additions.id = subtractions.referenceentryid
WHERE
additions.dead = FALSE AND additions.referenceentryid IS NULL
GROUP BY
subtractions.referenceentryid, additions.id
), dead_rows AS (
SELECT
l.id AS id,
-- only additions that still have usable points can expire
l.referenceentryid IS NULL AND e.balance > 0 AND e.passed_expiration AS expired
FROM
loyalty_ledger AS l
INNER JOIN
entries AS e
ON
(l.id = e.id OR l.referenceentryid = e.id)
WHERE
l.dead = FALSE AND
(e.balance <= 0 OR e.passed_expiration)
ORDER BY e.balance DESC
)
UPDATE
loyalty_ledger AS l
SET
(dead, expired) = (TRUE, d.expired)
FROM
dead_rows AS d
WHERE
l.id = d.id AND
l.dead = FALSE;
I also believe
-- THE SLOW BIT:
SELECT
*
FROM
loyalty_ledger AS ledger
WHERE
ledger.dead = FALSE AND
(ledger.id IN (SELECT id FROM dead_entries) OR ledger.referenceentryid IN (SELECT id FROM dead_entries));
Can be rewritten into a JOIN and UNION ALL which most likely also will generate a other execution plan and might be faster.
But hard to verify for sure without the other table structures.
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT id FROM dead_entries) AS dead_entries
ON ledger.id = dead_entries.id AND ledger.dead = FALSE
UNION ALL
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT id FROM dead_entries) AS dead_entries
ON ledger.referenceentryid = dead_entries.id AND ledger.dead = FALSE
And because CTE's in PostgreSQL are materialized and not indexed. Your are most likely better off removing the dead_entries alias from the CTE and repeat outside the CTE.
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT
id
FROM
entries
WHERE
subtraction >= addition OR expired = TRUE) AS dead_entries
ON ledger.id = dead_entries.id AND ledger.dead = FALSE
UNION ALL
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT
id
FROM
entries
WHERE
subtraction >= addition OR expired = TRUE) AS dead_entries
ON ledger.referenceentryid = dead_entries.id AND ledger.dead = FALSE

How to index group by date_trunc('second',date)

How to index 'group by date_trunc('second',date'). The cost for date_trunc('second',date) is very high even i added the following index.
select c.name, t.code ,count(*) count_second,sum(volume) sum_volume from
tradedetail t inner join company c on c.code=t.code and country = 'MY'
where date::date=current_date and t.buysell='S' group by t.code,
c.name,date_trunc('second',date) having sum(volume) >=4000
CREATE INDEX tradedetail_sbuysell
ON tradedetail
USING btree
(code COLLATE pg_catalog."default", date_trunc('second',date), buysell COLLATE pg_catalog."default")
where buysell = 'S';
"GroupAggregate (cost=4094738.36..4108898.28 rows=435690 width=30)"
" Filter: (sum(t.volume) >= 4000::numeric)"
" -> Sort (cost=4094738.36..4095827.58 rows=435690 width=30)"
" Sort Key: t.code, c.name, (date_trunc('second'::text, t.date))"
" -> Nested Loop (cost=0.01..4033076.58 rows=435690 width=30)"
" -> Seq Scan on company c (cost=0.00..166.06 rows=4735 width=14)"
" Filter: ((country)::text = 'MY'::text)"
" -> Index Scan using tradedetail_index5 on tradedetail t (cost=0.01..849.77 rows=172 width=22)"
" Index Cond: (((code)::text = (c.code)::text) AND ((date)::date = ('now'::cstring)::date) AND ((buysell)::text = 'S'::text))"

Why can't PostgreSQL do this simple FULL JOIN?

Here's a minimal setup with 2 tables a and b each with 3 rows:
CREATE TABLE a (
id SERIAL PRIMARY KEY,
value TEXT
);
CREATE INDEX ON a (value);
CREATE TABLE b (
id SERIAL PRIMARY KEY,
value TEXT
);
CREATE INDEX ON b (value);
INSERT INTO a (value) VALUES ('x'), ('y'), (NULL);
INSERT INTO b (value) VALUES ('y'), ('z'), (NULL);
Here is a LEFT JOIN that works fine as expected:
SELECT * FROM a
LEFT JOIN b ON a.value IS NOT DISTINCT FROM b.value;
with output:
id | value | id | value
----+-------+----+-------
1 | x | |
2 | y | 1 | y
3 | | 3 |
(3 rows)
Changing "LEFT JOIN" to "FULL JOIN" gives an error:
SELECT * FROM a
FULL JOIN b ON a.value IS NOT DISTINCT FROM b.value;
ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions
Can someone please answer:
What is a "merge-joinable or hash-joinable join condition" and why joining on a.value IS NOT DISTINCT FROM b.value doesn't fulfill this condition, but a.value = b.value is perfectly fine?
It seems that the only difference is how NULL values are handled. Since the value column is indexed in both tables, running an EXPLAIN on a NULL lookup is just as efficient as looking up values that are non-NULL:
EXPLAIN SELECT * FROM a WHERE value = 'x';
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=4.20..13.67 rows=6 width=36)
Recheck Cond: (value = 'x'::text)
-> Bitmap Index Scan on a_value_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (value = 'x'::text)
EXPLAIN SELECT * FROM a WHERE value ISNULL;
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=4.20..13.65 rows=6 width=36)
Recheck Cond: (value IS NULL)
-> Bitmap Index Scan on a_value_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (value IS NULL)
This has been tested with PostgreSQL 9.6.3 and 10beta1.
There has been discussion about this issue, but it doesn't directly answer the above question.
PostgreSQL implements FULL OUTER JOIN with either a hash or a merge join.
To be eligible for such a join, the join condition has to have the form
<expression using only left table> <operator> <expression using only right table>
Now your join condition does look like this, but PostgreSQL does not have a special IS NOT DISTINCT FROM operator, so it parses your condition into:
(NOT ($1 IS DISTINCT FROM $2))
And such an expression cannot be used for hash or merge joins, hence the error message.
I can think of a way to work around it:
SELECT a_id, NULLIF(a_value, '<null>'),
b_id, NULLIF(b_value, '<null>')
FROM (SELECT id AS a_id,
COALESCE(value, '<null>') AS a_value
FROM a
) x
FULL JOIN
(SELECT id AS b_id,
COALESCE(value, '<null>') AS b_value
FROM b
) y
ON x.a_value = y.b_value;
That works if <null> does not appear anywhere in the value columns.
I just solved such a case by replacing the ON condition with "TRUE", and moving the original "ON" condition into a WHERE clause. I don't know the performance impact of this, though.

postgresql ignores index for recursive query

I have a table representing graph of hierarchy links (parent_id, child_id)
The table has indexes on parent, child and both.
The graph may contain loops and i need to check them (or, maybe i need find all loops to eliminate them).
And i need to query all parents of a node recustively.
For this i use this query (it is supposed to be saved in view):
WITH RECURSIVE recursion(parent_id, child_id, node_id, path) AS (
SELECT h.parent_id,
h.child_id,
h.child_id AS node_id,
ARRAY[h.parent_id, h.child_id] AS path
FROM hierarchy h
UNION ALL
SELECT h.parent_id,
h.child_id,
r.node_id,
ARRAY[h.parent_id] || r.path
FROM hierarchy h JOIN recursion r ON h.child_id = r.parent_id
WHERE NOT r.path #> ARRAY[h.parent_id]
)
SELECT parent_id,
child_id,
node_id,
path
FROM recursion
where node_id = 883
For this query postgres is going to use very terrific plan:
"CTE Scan on recursion (cost=2703799682.88..4162807558.70 rows=324223972 width=56)"
" Filter: (node_id = 883)"
" CTE recursion"
" -> Recursive Union (cost=0.00..2703799682.88 rows=64844794481 width=56)"
" -> Seq Scan on hierarchy h (cost=0.00..74728.61 rows=4210061 width=56)"
" -> Merge Join (cost=10058756.99..140682906.47 rows=6484058442 width=56)"
" Merge Cond: (h_1.child_id = r.parent_id)"
" Join Filter: (NOT (r.path #> ARRAY[h_1.parent_id]))"
" -> Index Scan using hierarchy_idx_child on hierarchy h_1 (cost=0.43..256998.25 rows=4210061 width=16)"
" -> Materialize (cost=10058756.56..10269259.61 rows=42100610 width=48)"
" -> Sort (cost=10058756.56..10164008.08 rows=42100610 width=48)"
" Sort Key: r.parent_id"
" -> WorkTable Scan on recursion r (cost=0.00..842012.20 rows=42100610 width=48)"
It seems like postgres does not understand that external filter on node_id is applied to child_id in first recursive subquery.
I suppose i'm doing very wrong thing. But where exactly?
Looks like you just need to move WHERE node_id = 883 to first part of union:
WITH RECURSIVE recursion(parent_id, child_id, node_id, path) AS (
SELECT h.parent_id,
h.child_id,
h.child_id AS node_id,
ARRAY[h.parent_id, h.child_id] AS path
FROM hierarchy h
WHERE node_id = 883
UNION ALL
SELECT h.parent_id,
h.child_id,
r.node_id,
ARRAY[h.parent_id] || r.path
FROM hierarchy h JOIN recursion r ON h.child_id = r.parent_id
WHERE NOT r.path #> ARRAY[h.parent_id]
)
SELECT parent_id,
child_id,
node_id,
path
FROM recursion
Here is much more effective way to solve the graph traversing tasks.
CREATE OR REPLACE FUNCTION public.terr_ancestors(IN bigint)
RETURNS TABLE(node_id bigint, depth integer, path bigint[]) AS
$BODY$
WITH RECURSIVE recursion(node_id, depth, path) AS (
SELECT $1 as node_id, 0, ARRAY[$1] AS path
UNION ALL
SELECT h.parent_id as node_id, r.depth + 1, h.parent_id || r.path
FROM hierarchy h JOIN recursion r ON h.child_id = r.node_id
WHERE h.parent_id != ANY(path)
)
SELECT * FROM recursion
$BODY$
And similar way for descendants.