How to index group by date_trunc('second',date)

How to index group by date_trunc('second',date) - postgresql

How to index 'group by date_trunc('second',date'). The cost for date_trunc('second',date) is very high even i added the following index.
select c.name, t.code ,count(*) count_second,sum(volume) sum_volume from
tradedetail t inner join company c on c.code=t.code and country = 'MY'
where date::date=current_date and t.buysell='S' group by t.code,
c.name,date_trunc('second',date) having sum(volume) >=4000
CREATE INDEX tradedetail_sbuysell
ON tradedetail
USING btree
(code COLLATE pg_catalog."default", date_trunc('second',date), buysell COLLATE pg_catalog."default")
where buysell = 'S';
"GroupAggregate (cost=4094738.36..4108898.28 rows=435690 width=30)"
" Filter: (sum(t.volume) >= 4000::numeric)"
" -> Sort (cost=4094738.36..4095827.58 rows=435690 width=30)"
" Sort Key: t.code, c.name, (date_trunc('second'::text, t.date))"
" -> Nested Loop (cost=0.01..4033076.58 rows=435690 width=30)"
" -> Seq Scan on company c (cost=0.00..166.06 rows=4735 width=14)"
" Filter: ((country)::text = 'MY'::text)"
" -> Index Scan using tradedetail_index5 on tradedetail t (cost=0.01..849.77 rows=172 width=22)"
" Index Cond: (((code)::text = (c.code)::text) AND ((date)::date = ('now'::cstring)::date) AND ((buysell)::text = 'S'::text))"

Related

Attempt to avoid duplicated code with CTE takes far too long

I need to create a view with some calculated and aggregated values. So I need certain values multiple times, like total_dist_pts in the example below.
There is a loc_a_run table with about 350 rows so far (constantly growing) and a loc_a_pline table with somewhat more than 4 million rows (also growing):
prod=# \d loc_a_run
Table "public.loc_a_run"
Column | Type | Collation | Nullable | Default
----------------+--------------------------+-----------+----------+-----------------------------
loc_a_run_id | integer | | not null | generated always as identity
execution_time | timestamp with time zone | | not null | CURRENT_TIMESTAMP(0)
run_type_id | smallint | | not null |
...
has_errors | boolean | | |
Indexes:
"loc_a_run_pkey" PRIMARY KEY, btree (loc_a_run_id)
prod=# \d loc_a_pline
Table "public.loc_a_pline"
Column | Type | Collation | Nullable | Default
----------------+-----------+-----------+----------+-----------------------------
loc_a_pline_id | bigint | | not null | generated always as identity
loc_a_run_id | integer | | not null |
is_top | boolean | | not null |
...
dist_left | numeric | | not null |
dist_right | numeric | | not null |
...
Indexes:
"loc_a_pline_pkey" PRIMARY KEY, btree (loc_a_pline_id)
Foreign-key constraints:
"loc_a_pline_loc_a_run_id_fkey" FOREIGN KEY (loc_a_run_id) REFERENCES loc_a_run(loc_a_run_id) ON UPDATE CASCADE ON DELETE CASCADE
The solution I use right now:
SELECT run.loc_a_run_id AS run_id
, run_type.run_type
, SUM(
CASE
WHEN pline.is_top IS true
THEN ROUND(pline.dist_right - pline.dist_left, 2)
ELSE ROUND(pline.dist_left - pline.dist_right, 2)
END)
AS total_dist_pts
, COUNT(pline.loc_a_pline_id) AS total_plines
, SUM(
CASE
WHEN pline.is_top IS true
THEN ROUND(pline.dist_right - pline.dist_left, 2)
ELSE ROUND(pline.dist_left - pline.dist_right, 2)
END)
/ COUNT(pline.loc_a_pline_id)
AS dist_pts_per_pline
FROM loc_a_run AS run
JOIN loc_a_pline AS pline USING (loc_a_run_id)
JOIN run_type USING (run_type_id)
WHERE run.has_errors IS false
GROUP BY run_id, run_type;
Query plan:
"Finalize GroupAggregate (cost=154201.17..154577.71 rows=1365 width=108)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Gather Merge (cost=154201.17..154519.69 rows=2730 width=76)"
" Workers Planned: 2"
" -> Sort (cost=153201.15..153204.56 rows=1365 width=76)"
" Sort Key: run.loc_a_run_id, run_type.run_type"
" -> Partial HashAggregate (cost=153113.01..153130.07 rows=1365 width=76)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Hash Join (cost=21.67..120633.75 rows=1623963 width=62)"
" Hash Cond: (run.run_type_id = run_type.run_type_id)"
" -> Hash Join (cost=20.55..112756.41 rows=1623963 width=32)"
" Hash Cond: (pline.loc_a_run_id = run.loc_a_run_id)"
" -> Parallel Seq Scan on loc_a_pline pline (cost=0.00..107766.55 rows=1867855 width=30)"
" -> Hash (cost=17.14..17.14 rows=273 width=6)"
" -> Seq Scan on loc_a_run run (cost=0.00..17.14 rows=273 width=6)"
" Filter: (has_errors IS FALSE)"
" -> Hash (cost=1.05..1.05 rows=5 width=34)"
" -> Seq Scan on loc_a_run_type run_type (cost=0.00..1.05 rows=5 width=34)"
This takes around 14.2s to execute. I lack the experience to assess how good or bad the performance is for this part, but I could live with it. Of course, faster would be an advantage.
Because this contains duplicated code I tried to get rid of it by using a CTE (in the final view I need this for a few more calculations, but the pattern is the same):
WITH dist_pts AS (
SELECT loc_a_run_id
, CASE
WHEN is_top IS true
THEN ROUND(dist_right - dist_left, 2)
ELSE ROUND(dist_left - dist_right, 2)
END AS pts
FROM loc_a_pline
)
SELECT run.loc_a_run_id AS run_id
, run_type.run_type
, SUM(dist_pts.pts) AS total_dist_pts
, COUNT(pline.loc_a_pline_id) AS total_plines
, SUM(dist_pts.pts) / COUNT(pline.loc_a_pline_id) AS dist_pts_per_pline
FROM loc_a_run AS run
JOIN dist_pts USING (loc_a_run_id)
JOIN loc_a_pline AS pline USING (loc_a_run_id)
JOIN run_type USING (run_type_id)
WHERE run.has_errors IS false
GROUP BY run_id, run_type;
Query plan:
"Finalize GroupAggregate (cost=575677889.59..575678266.13 rows=1365 width=108)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Gather Merge (cost=575677889.59..575678208.12 rows=2730 width=76)"
" Workers Planned: 2"
" -> Sort (cost=575676889.57..575676892.98 rows=1365 width=76)"
" Sort Key: run.loc_a_run_id, run_type.run_type"
" -> Partial HashAggregate (cost=575676801.43..575676818.49 rows=1365 width=76)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Parallel Hash Join (cost=155366.81..111024852.15 rows=23232597464 width=62)"
" Hash Cond: (loc_a_pline.loc_a_run_id = run.loc_a_run_id)"
" -> Parallel Seq Scan on loc_a_pline (cost=0.00..107877.85 rows=1869785 width=22)"
" -> Parallel Hash (cost=120758.30..120758.30 rows=1625641 width=48)"
" -> Hash Join (cost=21.67..120758.30 rows=1625641 width=48)"
" Hash Cond: (run.run_type_id = run_type.run_type_id)"
" -> Hash Join (cost=20.55..112872.83 rows=1625641 width=18)"
" Hash Cond: (pline.loc_a_run_id = run.loc_a_run_id)"
" -> Parallel Seq Scan on loc_a_pline pline (cost=0.00..107877.85 rows=1869785 width=12)"
" -> Hash (cost=17.14..17.14 rows=273 width=6)"
" -> Seq Scan on loc_a_run run (cost=0.00..17.14 rows=273 width=6)"
" Filter: (has_errors IS FALSE)"
" -> Hash (cost=1.05..1.05 rows=5 width=34)"
" -> Seq Scan on loc_a_run_type run_type (cost=0.00..1.05 rows=5 width=34)"
This takes forever and seems to be the wrong approach. I struggle to understand the query plan to find my mistake(s).
So my questions are:
Why does the CTE approach take so much time?
What would be the smartest solution to avoid duplicated code and eventually reduce execution time?
Is there a way to SUM(dist_pts.pts) only once?
Is there a way to COUNT(pline.loc_a_pline_id) in the same go as the subtraction in the CTE instead of accessing the big loc_a_pline table again? (is it accessed again at all?)
Any help is highly appreciated

Consider creating an index on loc_a_pline.loc_a_run_id. Postgres doesn't automatically create indexes on the referencing side of FK relationships. That should drastically improve the speed and remove the sequential scans over loc_a_pline from the execution plan.
Additionally, I'd suggest gathering all the data you want in the first portion of the CTE and then separating the aggregates out into their own portion. Something like this that accesses all of the tables once and aggregates over the set once:
WITH dist_pts AS (
SELECT run.loc_a_run_id rid
, pline.loc_a_pline_id pid
, rt.run_type
, CASE
WHEN pline.is_top IS true
THEN ROUND(pline.dist_right - pline.dist_left, 2)
ELSE ROUND(pline.dist_left - pline.dist_right, 2)
END AS pts
FROM loc_a_run run
JOIN loc_a_pline pline ON run.loc_a_run_id = pline.loc_a_run_id
JOIN run_type rt ON run.run_type_id = rt.run_type_id
WHERE run.has_errors IS FALSE
), aggs AS
(
SELECT SUM(dist_pts.pts)::NUMERIC total_dist_pts
, COUNT(dist_pts.pid)::NUMERIC total_plines
, dist_pts.rid
FROM dist_pts
GROUP BY dist_pts.rid
)
SELECT dp.rid
, dp.run_type
, aggs.total_dist_pts
, aggs.total_plines
, (aggs.total_dist_pts / aggs.total_plines) dist_pts_per_pline
FROM dist_pts dp
JOIN aggs ON dp.rid = aggs.rid
GROUP BY dp.rid, dp.run_type, aggs.total_dist_pts, aggs.total_plines
;

Optimize postgress update query with join statement on large tables

I have 2 tables A and B.
A contains ~6M records and B ~1M records.
B contains foreign key to A, column name a_id.
I need to move the relationship between the tables, so table A will contain b_id, and remove the a_id from B.
I created this migration:
UPDATE A
SET b_id = B.id
FROM B
INNER JOIN A ON B.a_id = A.id;
This is the explain query:
"Update on A (cost=64444.71..138800306174.79 rows=7984529761815 width=915)"
" -> Nested Loop (cost=64444.71..138800306174.79 rows=7984529761815 width=915)"
" -> Seq Scan on A (cost=0.00..485941.45 rows=5779245 width=899)"
" -> Materialize (cost=64444.71..642396.62 rows=1381587 width=16)"
" -> Hash Join (cost=64444.71..628741.69 rows=1381587 width=16)"
" Hash Cond: (A.id = B.a_id)"
" -> Seq Scan on A (cost=0.00..485941.45 rows=5779245 width=10)"
" -> Hash (cost=40427.87..40427.87 rows=1381587 width=14)"
" -> Seq Scan on B (cost=0.00..40427.87 rows=1381587 width=14)"
This is taking too long(More than half an hour), more than my time window i have for the migration. Is there anything to do in order to optimize the update statement?

Matching on postal code and city name - very slow in PostgreSQL

I am trying to update address fields in mytable with data from othertable.
If I match on postal codes and search for city names from othertable in mytable, it works reasonably fast. But as I don't have postal codes in all cases, I also want to look for names only in a 2nd query. This takes hours (>12h). Any ideas what I can do to speed up the query? Please note that indexing did not help. Index scans in (2) weren't faster.
Code for matching on postal code + name (1)
update mytable t1 set
admin1 = t.admin1,
admin2 = t.admin2,
admin3 = t.admin3,
postal_code = t.postal_code,
lat = t.lat,
lng = t.lng from (
select * from othertable) t
where t.postal_code = t1.postal_code and t1.country = t.country
and upper(t1.address) like '%' || t.admin1 || '%' --looks whether city name from othertable shows up in address in t1
and admin1 is null;
Code for matching on name only (2)
update mytable t1 set
admin1 = t.admin1,
admin2 = t.admin2,
admin3 = t.admin3,
postal_code = t.postal_code,
lat = t.lat,
lng = t.lng from (
select * from othertable) t
where t1.country = t.country
and upper(t1.address) like '%' || t.admin1 || '%' --looks whether city name from othertable shows up in address in t1
and admin1 is null;
Query plan 1:
"Update on mytable t1 (cost=19084169.53..19205622.16 rows=13781 width=1918)"
" -> Merge Join (cost=19084169.53..19205622.16 rows=13781 width=1918)"
" Merge Cond: (((t1.postal_code)::text = (othertable.postal_code)::text) AND (t1.country = othertable.country))"
" Join Filter: (upper((t1.address)::text) ~~ (('%'::text || othertable.admin1) || '%'::text))"
" -> Sort (cost=18332017.34..18347693.77 rows=6270570 width=1661)"
" Sort Key: t1.postal_code, t1.country"
" -> Seq Scan on mytable t1 (cost=0.00..4057214.31 rows=6270570 width=1661)"
" Filter: (admin1 IS NULL)"
" -> Materialize (cost=752152.19..766803.71 rows=2930305 width=92)"
" -> Sort (cost=752152.19..759477.95 rows=2930305 width=92)"
" Sort Key: othertable.postal_code, othertable.country"
" -> Seq Scan on othertable (cost=0.00..136924.05 rows=2930305 width=92)"
Query plan 2:
"Update on mytable t1 (cost=19084169.53..27246633167.33 rows=5464884210 width=1918)"
" -> Merge Join (cost=19084169.53..27246633167.33 rows=5464884210 width=1918)"
" Merge Cond: (t1.country = othertable.country)"
" Join Filter: (upper((t1.address)::text) ~~ (('%'::text || othertable.admin1) || '%'::text))"
" -> Sort (cost=18332017.34..18347693.77 rows=6270570 width=1661)"
" Sort Key: t1.country"
" -> Seq Scan on mytable t1 (cost=0.00..4057214.31 rows=6270570 width=1661)"
" Filter: (admin1 IS NULL)"
" -> Materialize (cost=752152.19..766803.71 rows=2930305 width=92)"
" -> Sort (cost=752152.19..759477.95 rows=2930305 width=92)"
" Sort Key: othertable.country"
" -> Seq Scan on othertable (cost=0.00..136924.05 rows=2930305 width=92)"

In the second query, you are joining (more or less) on city name, but the othertable has several entries per city name, so you are updating mytable several times per record, with unpredictable value (which lat-long or other admin2/3 will be the last one to be updated?)
If othertable has entries without postal code, use them by adding an extra condition AND othertable.posalcode is null
Else, you will want to get a subset of othertable that returns one row per admin1 + country value. You would replace select * from othertable by the following query. Of course you might want to adjust it to get another lat/long/admin2-3 than the 1st one..
SELECT admin1, country, first(postal_code) postal_code, first(lat) lat, first(lng) lng, first(admin2) admin2, first(admin3) admin3
FROM othertable
GROUP BY admin1,country
Worst, the second query overwrite what was updated in the 1st query, so you must ignore these records by adding and mytable.postalcode is null
The entire query could be
UPDATE mytable t1
SET
admin1 = t.admin1,
admin2 = t.admin2,
admin3 = t.admin3,
postal_code = t.postal_code,
lat = t.lat,
lng = t.lng
FROM (
SELECT admin1, country, first(postal_code) postal_code, first(lat) lat, first(lng) lng, first(admin2) admin2, first(admin3) admin3
FROM othertable
GROUP BY admin1,country) t
WHERE t1.country = t.country
AND upper(t1.address) like '%' || t.admin1 || '%' --looks whether city name from othertable shows up in address in t1
AND admin1 is null
AND mytable.postal_code is null;

Filtering nulls after an outer join in postgres

I'm currently trying to figure out how to do a filter with a left join involving nulls. Here's a simplified
version of the schema I'm working on:
CREATE TABLE bookclubs (
bookclub_id UUID NOT NULL PRIMARY KEY
);
CREATE TABLE books (
bookclub_id UUID NOT NULL,
book_id UUID NOT NULL
);
ALTER TABLE books ADD CONSTRAINT books_pk PRIMARY KEY(bookclub_id, book_id);
ALTER TABLE books ADD CONSTRAINT book_to_bookclub FOREIGN KEY(bookclub_id)
REFERENCES bookclubs(bookclub_id) ON UPDATE NO ACTION ON DELETE CASCADE;
CREATE INDEX books_bookclub_index ON books (bookclub_id);
CREATE TABLE book_reviews (
bookclub_id UUID NOT NULL,
book_id UUID NOT NULL,
reviewer_id TEXT NOT NULL,
rating int8 NOT NULL
);
ALTER TABLE book_reviews ADD CONSTRAINT book_reviews_pk PRIMARY KEY(bookclub_id, book_id, reviewer_id);
ALTER TABLE book_reviews ADD CONSTRAINT book_review_to_book FOREIGN KEY(bookclub_id,book_id)
REFERENCES books(bookclub_id,book_id) ON UPDATE NO ACTION ON DELETE CASCADE;
CREATE INDEX book_review_to_book_index ON book_reviews ( bookclub_id, book_id);
CREATE INDEX book_review_by_reviewer ON book_reviews ( bookclub_id, reviewer_id, rating);
I want a query that for a given bookclub_id and reviewer_id, returns me all the books that they've rated >= 3, or that they haven't rated. Books they haven't rated have no entry in the book_reviews table, which isn't something I can do anything about. rating is actually an enum if that's relevant, but I don't think it is.
My first attempt at doing the obvious thing failed:
SELECT *
FROM books
LEFT OUTER JOIN book_reviews
ON ( ( ( books.bookclub_id = book_reviews.bookclub_id )
AND ( books.book_id = book_reviews.book_id ) )
AND ( book_reviews.reviewer_id = 'alice' ) )
WHERE books.bookclub_id = '00000000-0000-0000-0000-000000000000'
AND book_reviews.rating != 1
AND book_reviews.rating != 2;
This drops books that don't have reviews from the user, which makes some sense once I think about how the WHERE conditions are actually implemented. Here's the query plan
Nested Loop (cost=0.30..16.39 rows=1 width=104)
-> Index Scan using book_reviews_pk on book_reviews (cost=0.15..8.21 rows=1 width=72)
Index Cond: ((bookclub_id = '00000000-0000-0000-0000-000000000000'::uuid) AND (reviewer_id = 'alice'::text))
Filter: ((rating <> 1) AND (rating <> 2))
-> Index Only Scan using books_pk on books (cost=0.15..8.17 rows=1 width=32)
Index Cond: ((bookclub_id = '00000000-0000-0000-0000-000000000000'::uuid) AND (book_id = book_reviews.book_id))
So I added an explicit check for null:
SELECT *
FROM books
LEFT OUTER JOIN book_reviews
ON ( ( ( books.bookclub_id = book_reviews.bookclub_id )
AND ( books.book_id = book_reviews.book_id ) )
AND ( book_reviews.reviewer_id = 'alice' ) )
WHERE books.bookclub_id = '00000000-0000-0000-0000-000000000000'
AND book_reviews.rating IS NULL
OR ( book_reviews.rating != 1
AND book_reviews.rating != 2);
This returns the correct results, but appears to be horribly inefficient, and grinds the db to a halt. Here's the query plan
Hash Left Join (cost=18.75..52.56 rows=1346 width=104)
Hash Cond: ((books.bookclub_id = book_reviews.bookclub_id) AND (books.book_id = book_reviews.book_id))
Filter: (((books.bookclub_id = '00000000-0000-0000-0000-000000000000'::uuid) AND (book_reviews.rating IS NULL)) OR ((book_reviews.rating <> 1) AND (book_reviews.rating <> 2)))
-> Seq Scan on books (cost=0.00..23.60 rows=1360 width=32)
-> Hash (cost=18.69..18.69 rows=4 width=72)
-> Bitmap Heap Scan on book_reviews (cost=10.23..18.69 rows=4 width=72)
Recheck Cond: (reviewer_id = 'alice'::text)
-> Bitmap Index Scan on book_review_by_reviewer (cost=0.00..10.22 rows=4 width=0)
Index Cond: (reviewer_id = 'alice'::text)
I'm no expert on interpreting these things, but that Filter moving to the outside seems bad. Is there an efficient way to structure the query such that I can get the result I want? Thanks

Move the filter to the join condition:
SELECT *
FROM
books
LEFT OUTER JOIN
book_reviews ON
books.bookclub_id = book_reviews.bookclub_id
AND books.book_id = book_reviews.book_id
AND book_reviews.reviewer_id = 'alice'
AND book_reviews.rating != 1
AND book_reviews.rating != 2
WHERE books.bookclub_id = '00000000-0000-0000-0000-000000000000'
or a bit shorter:
AND book_reviews.rating not in (1, 2)

I believe we figured it out. We were missing a set of parens in the WHERE clause:
SELECT *
FROM books
LEFT OUTER JOIN book_reviews
ON ( ( ( books.bookclub_id = book_reviews.bookclub_id )
AND ( books.book_id = book_reviews.book_id ) )
AND ( book_reviews.reviewer_id = 'alice' ) )
WHERE books.bookclub_id = '00000000-0000-0000-0000-000000000000'
AND ( book_reviews.rating IS NULL
OR ( book_reviews.rating != 1
AND book_reviews.rating != 2) );
Without that the boolean logic associates wrong. This query returns the right result and has a sane query plan, so it looks like that was the entire issue. Thanks for looking.

Why can't PostgreSQL do this simple FULL JOIN?

Here's a minimal setup with 2 tables a and b each with 3 rows:
CREATE TABLE a (
id SERIAL PRIMARY KEY,
value TEXT
);
CREATE INDEX ON a (value);
CREATE TABLE b (
id SERIAL PRIMARY KEY,
value TEXT
);
CREATE INDEX ON b (value);
INSERT INTO a (value) VALUES ('x'), ('y'), (NULL);
INSERT INTO b (value) VALUES ('y'), ('z'), (NULL);
Here is a LEFT JOIN that works fine as expected:
SELECT * FROM a
LEFT JOIN b ON a.value IS NOT DISTINCT FROM b.value;
with output:
id | value | id | value
----+-------+----+-------
1 | x | |
2 | y | 1 | y
3 | | 3 |
(3 rows)
Changing "LEFT JOIN" to "FULL JOIN" gives an error:
SELECT * FROM a
FULL JOIN b ON a.value IS NOT DISTINCT FROM b.value;
ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions
Can someone please answer:
What is a "merge-joinable or hash-joinable join condition" and why joining on a.value IS NOT DISTINCT FROM b.value doesn't fulfill this condition, but a.value = b.value is perfectly fine?
It seems that the only difference is how NULL values are handled. Since the value column is indexed in both tables, running an EXPLAIN on a NULL lookup is just as efficient as looking up values that are non-NULL:
EXPLAIN SELECT * FROM a WHERE value = 'x';
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=4.20..13.67 rows=6 width=36)
Recheck Cond: (value = 'x'::text)
-> Bitmap Index Scan on a_value_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (value = 'x'::text)
EXPLAIN SELECT * FROM a WHERE value ISNULL;
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=4.20..13.65 rows=6 width=36)
Recheck Cond: (value IS NULL)
-> Bitmap Index Scan on a_value_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (value IS NULL)
This has been tested with PostgreSQL 9.6.3 and 10beta1.
There has been discussion about this issue, but it doesn't directly answer the above question.

PostgreSQL implements FULL OUTER JOIN with either a hash or a merge join.
To be eligible for such a join, the join condition has to have the form
<expression using only left table> <operator> <expression using only right table>
Now your join condition does look like this, but PostgreSQL does not have a special IS NOT DISTINCT FROM operator, so it parses your condition into:
(NOT ($1 IS DISTINCT FROM $2))
And such an expression cannot be used for hash or merge joins, hence the error message.
I can think of a way to work around it:
SELECT a_id, NULLIF(a_value, '<null>'),
b_id, NULLIF(b_value, '<null>')
FROM (SELECT id AS a_id,
COALESCE(value, '<null>') AS a_value
FROM a
) x
FULL JOIN
(SELECT id AS b_id,
COALESCE(value, '<null>') AS b_value
FROM b
) y
ON x.a_value = y.b_value;
That works if <null> does not appear anywhere in the value columns.

I just solved such a case by replacing the ON condition with "TRUE", and moving the original "ON" condition into a WHERE clause. I don't know the performance impact of this, though.