How to use multiple "unique index inferences" in a Postgresql query - postgresql

In a Postgresql 9.6 DB, there is a existing table named X that has four columns, a, b, c, and d with indices setup like this:
"uidx_a_b" UNIQUE, btree (a, b) WHERE c IS NULL AND d IS NULL
"uidx_a_c" UNIQUE, btree (a, b, c) WHERE c IS NOT NULL AND d IS NULL
"uidx_a_d" UNIQUE, btree (a, b, c, d) WHERE c IS NOT NULL AND d IS NOT NULL
I don't know why this was done as it was done by someone long gone and before I had to modify it.
I am trying to get the syntax correct for specifying all three of these in an ON CONFLICT statement. I tried every variation I could think of all with error. The Postgresql Documentation indicates this is possible, specifically the [, ...] described in the conflict_target here:
( { index_column_name | ( index_expression ) } [ COLLATE collation ] [ opclass ] [, ...] )
Also, this blog from one of the committers says so. Finally, I looked at this unit test for the functionality again to no avail! Having thus given up, I am turning to SO to seek help.
This is what I believe is the closest syntax I tried that should work:
ON CONFLICT (
((a, b) WHERE c IS NULL AND d IS NULL),
((a, b, c) WHERE c IS NOT NULL AND d IS NULL),
((a, b, c, d) WHERE release_id IS NOT NULL AND d IS NOT NULL)
)
However this fails with:
ERROR: syntax error at or near ","
While I am open to suggestions to improve the design of those indices, I really want to know if there is a valid syntax for specifying the ON CONFLICT clause as it seems there should be!

In the following I am referring to the syntax description from the documentation:
ON CONFLICT [ conflict_target ] conflict_action
where conflict_target can be one of:
( { index_column_name | ( index_expression ) }
[ COLLATE collation ] [ opclass ] [, ...] ) [ WHERE index_predicate ]
ON CONSTRAINT constraint_name
INSERT ... ON CONFLICT allows only a single conflict_target.
The [, ...] means that more than one column or expression can be specified (to indicate a single condition), like this:
ON CONFLICT (col1, (col2::text), col3)
Moreover, if it is a partial index, the WHERE condition must be implied by index_predicate.
So what can you do?
You can follow the advice from #joop and find a value that cannot occur in columns c and d.
Then you can replace your three indexes with:
CREATE UNIQUE INDEX ON x (a, b, coalesce(c, -1), coalesce(d, -1));
The conflict_target would then become:
ON CONFLICT (a, b, coalesce(c, -1), coalesce(d, -1))

Related

PostgreSQL Upsert With Multiple Constraints

I'm trying to do an upsert on a table with two constraints. One is that the column a is unique, the other is that the columns b, c, d and e are unique together. What I don't want is that a, b, c, d and e are unique together, because that would allow two rows having the same value in column a.
The following fails if the second constraint (unique b, c, d, e) is violated:
INSERT INTO my_table (a, b, c, d, e, f, g)
SELECT (a, b, c, d, e, f, g)
FROM my_temp_table temp
ON CONFLICT (a) DO UPDATE SET
a=EXCLUDED.a,
b=EXCLUDED.b,
c=EXCLUDED.c,
d=EXCLUDED.d,
e=EXCLUDED.e,
f=EXCLUDED.f,
g=EXCLUDED.g;
The following fails if the first constraint (unique a) is violated:
INSERT INTO my_table (a, b, c, d, e, f, g)
SELECT (a, b, c, d, e, f, g)
FROM my_temp_table temp
ON CONFLICT ON CONSTRAINT my_table_unique_together_b_c_d_e DO UPDATE SET
a=EXCLUDED.a,
b=EXCLUDED.b,
c=EXCLUDED.c,
d=EXCLUDED.d,
e=EXCLUDED.e,
f=EXCLUDED.f,
g=EXCLUDED.g;
How can I bring those two together? I first tried to define a constraint that says "either a is unique or b, c, d and e are unique together" but it looks like that isn't possible. I then tried two INSERT statements with WHERE clauses making sure that the other constraint doesn't get violated, but there is a third case where a row might violate both constraints at the same time. To handle the last case I considered dropping one of the constraints and creating it after the INSERT, but isn't there a better way to do this?
I tried this, but according to the PostgreSQL documentation it can only DO NOTHING:
INSERT INTO my_table (a, b, c, d, e, f, g)
SELECT (a, b, c, d, e, f, g)
FROM my_temp_table temp
ON CONFLICT DO UPDATE SET
a=EXCLUDED.a,
b=EXCLUDED.b,
c=EXCLUDED.c,
d=EXCLUDED.d,
e=EXCLUDED.e,
f=EXCLUDED.f,
g=EXCLUDED.g;
I read in another question that it might work using MERGE in PostgreSQL 15 but sadly it's not available on AWS RDS yet. I need to find a way to do this using PostgreSQL 14.
I think what you need is a somewhat different design. I suppose "a" is a surrogate key and b,c,d,e,f,g make up the natural key. And I suppose there are other columns, that are the data.
So force column "a" to be automatically generated, like this:
CREATE TEMP TABLE my_table(
a bigint GENERATED ALWAYS AS IDENTITY,
b bigint NOT NULL,
c bigint NOT NULL,
d bigint NOT NULL,
e bigint NOT NULL,
f bigint NOT NULL,
g bigint NOT NULL,
data text,
CONSTRAINT my_table_unique_together_b_c_d_e UNIQUE (b,c,d,e,f,g)
);
And then just skip the a column from your insert:
INSERT INTO my_table (b, c, d, e, f, g)
SELECT (b, c, d, e, f, g)
FROM my_temp_table temp
ON CONFLICT ON CONSTRAINT my_table_unique_together_b_c_d_e DO UPDATE SET
data=EXCLUDED.data;

how to unpivot large AWS Redshift table

I am trying to run a query against a table in AWS Redshift (i.e., postgresql). Below is a simplified definition of the table:
CREATE TABLE some_schema.some_table (
row_id int
,productid_level1 char(1)
,productid_level2 char(1)
,productid_level3 char(1)
)
;
INSERT INTO some_schema.some_table
VALUES
(1, a, b, c)
,(2, d, c, e)
,(3, c, f, g)
,(4, e, h, i)
,(5, f, j, k)
,(6, g, l, m)
;
I need to return a de-duped, single column table of a given productid and all of its children. "Children" means any productid that has "level" higher than the given product (for a given row) and also its grandchildren.
For example, for productid 'c', I expect to return...
'c' (because it's found in rows 1, 2, and 3)
'e' (because it's a child of 'c' in row 2)
'f' and 'g' (because they're children of 'c' in row 3)
'h' and 'i' (because they're children of 'e' in row 4)
'j' and 'k' (because they're children of 'f' in row 5)
and 'l' and 'm' (because they're children of 'g' in row 6)
Visually, I expect to return the following:
productid
---------
c
e
f
g
h
i
j
k
l
m
The actual table has about 3M rows and has about 20 "levels".
I think there are 2 parts to this query -- (1) a recursive CTE to build out the hierarchy and (2) an unpivot operation.
I have not attempted (1) yet. For (2), I have tried a query like the following, but it hasn't returned even after 3 minutes. As this will be used for an operational report, I need it to return in < 15 seconds.
select
b.productid
,b.product_level
from
some_schema.some_table as a
cross join lateral (
values
(a.productid_level1, 1)
,(a.productid_level2, 2)
...
,(a.productid_level20, 20)
) as b(productid, product_level)
How can I write the query to achieve (1) and (2) and be very performant?
I would avoid using the term Hierarchy, as that "usually" implies any node having a single parent at most.
I admit I'm lost as to the nature of the graph/network this table represents. But you might benefit from a little brute force and code repetition.
Whatever eventually works for you, I think you'll need to persist/materialise/cache the results, as repeating this at report time is unlikely to ever be a good idea.
I'm a data engineer by trade, and I'm sure they have good reasons for what they've done (or, like me, they maybe screwed up). Either way, there are many good reasons to ask them to materialise the graph in more than just one form, each suited to different use cases. So, asking them for a traditional adjacency list, as well as the table you already have, is a reasonable request. Or, at the very least, a good starting point for a conversation.
So, a brute force approach?
WITH
adjacency AS
(
SELECT level01, level02 FROM some_table WHERE level02 IS NOT NULL
UNION
SELECT level02, level03 FROM some_table WHERE level03 IS NOT NULL
UNION
...
UNION
SELECT level19, level20 FROM some_table WHERE level20 IS NOT NULL
)
The WHERE clause elimates any sparse data before it enters the map.
The UNION (without ALL) ensures duplicate links are eliminated. You should also test UNION ALL and then wrapping a SELECT DISTINCT around it (or similar).
Then you can use that adjacency list in the usual recursive walk, to find all children of a given node. (Taking care that there aren't any cyclic paths.)

Why LATERAL not works with values?

It not make sense, a literal is not a valid column?
SELECT x, y FROM (select 1 as x) t, LATERAL CAST(2 AS FLOAT) AS y; -- fine
SELECT x, y FROM (select 1 as x) t, LATERAL 2.0 AS y; -- SYNNTAX ERROR!
Same if you use CASE clause or x+1 expression or (x+1)... seems ERROR for any non-function.
The Pg Guide, about LATERAL expression (not LATERAL subquery), say
LATERAL is primarily useful when the cross-referenced column is necessary for computing the row(s) to be joined (...)
NOTES
The question is about LATERAL 1_column_expression not LATERAL multicolumn_subquery. Example:
SELECT x, y, exp, z
FROM (select 3) t(x), -- subquery
LATERAL round(x*0.2+1.2) as exp, -- expression!
LATERAL (SELECT exp+2.0 AS y, x||'foo' as z) t2 --subquery
;
... After #klin comment showing that the Guide in another point say "only functions", the question Why? must be expressed in a more specific way, changing a litle bit the scope of the question:
Not make sense "only funcions", the syntax (x) or (x+1), encapsulatening expression in parentesis, is fine, is not?Why only functions?
PS: perhaps there is a future plan, or perhaps a real problem on parsing generic expressions... As users we must show to PostgreSQL developers what make sense and we need.
It'll all work fine if you wrap it in its own subquery
SELECT x, y FROM (select 1 as x) t, LATERAL (SELECT 2.0 AS y) z;
A literal is a valid value for a column, but as the docs you quoted say, LATERAL syntax is used
for computing the row(s) to be joined
A relation, such as a FROM or JOIN or LATERAL subquery clause, always computes tuples of (a single or multiple) columns. The alias you're assigning is not for an individual row, but for the whole tuple.
Answering "Why only functions?" by intuition.
Or "Why does the PostgreSQL spec use only functions?". Of course, it's not a question about the parser, because it complies with the specification.
The SELECT syntax Guide show the only occasions when we can use LATERAL:
[ LATERAL ] ( select ) [ AS ] alias [ ( column_alias [, ...] ) ] ...
[ LATERAL ] function_name ( [ argument [, ...] ] ) ...
[ LATERAL ] ROWS FROM( function_name ( [ argument [, ...] ] ) ...
So, no conflict on
[ LATERAL ] (single_expression) [ AS ] alias
The guess of #Bergi is that a literal expression like LATERAL 2.0 AS y could be interpreted as LATERAL "2"."0", the "table 0" and "schema 2"... But, as we saw above, not make sense to expect a table name after clause LATERAL, so, in fact, no ambiguity.
Conclusion: it looks like the specification of LATERAL can grow and allow the use of expressions.This is the great advantage of being able to discuss and participate in an open community software!
Why LATERAL single_expression AS alias? Rationale:
to be orthogonal: any new user of PostgreSQL, that see that is valid SELECT a, x, x+b AS y FROM t, LATERAL f(a) AS x, will naturally try also expressions instead functions. It is expected in a "orthogonal system" and is intuitive for any programmer.
to reuse expressions: we use "chain of dependent expressions" in any language, things like a=b+c; x=a+y; z=a/2; .... It is ugly to do "SELECT(SELECT(SELECT))" in SQL, only for reuse expressions. The "chains of LATERALs" is more elegant and human-readable. And perhaps is better also for query optimization.

Postgresql update column of numeric type with NULL value fails if all value of this column is NULL

I have a database table like this:
idx[PK]
a[numeric]
b[numeric]
1
1
1
2
2
2
3
3
3
4
4
4
...
...
...
In pgadmin4, I tried to update this table with some null values, and I noticed the following queries failed:
UPDATE test as t SET
a = e.a,b = e.b
FROM (VALUES (1,NULL,NULL),(2,NULL,NULL),(3,NULL,NULL))
AS e(idx, a, b)
WHERE t.idx = e.idx
UPDATE test as t SET
a = e.a,b = e.b
FROM (VALUES (1,NULL,1),(2,NULL,2),(3,NULL,NULL))
AS e(idx, a, b)
WHERE t.idx = e.idx
The error message is like this:
ERROR: column "a" is of type numeric but expression is of type text
LINE 2: a = e.a,b = e.b
^
HINT: You will need to rewrite or cast the expression.
SQL state: 42804
Character: 43
However, this will be successful:
UPDATE test as t SET
a = e.a,b = e.b
FROM (VALUES (1,NULL,1),(2,2,NULL),(3,NULL,NULL))
AS e(idx, a, b)
WHERE t.idx = e.idx
It seems like if the new values for one of the columns I am updating are all NULL, then the query fails. However, as long as there is at least one value is numeric but NOT NULL, the query would be successful. Why is this?
I did simplify my real world case here as my actual table has millions of rows and more than 10 columns. Using Python and psycopg2, when I tried to update 50,000 rows in one query, even though there is a value in a column is NOT NULL, the previous error could still show up. I guess that is because the system scans a certain number of rows to decide if the type is correct or not instead of all 50,000 rows.
Therefore, how to avoid this failure in my real world situation? Is there a better query to use instead of UPDATE?
Thank you very much!
UPDATE
Per comments from #Marth and #Gordon Linoff, and as I am using psycopg2, I did the following in my code:
from psycopg2.extras import execute_values
sql = """UPDATE test as t SET
a = (e.a::numeric),
b = (e.b::numeric)
FROM (VALUES %s)
AS e(idx, a, b)
WHERE t.idx = e.idx"""
execute_values(cursor, sql, data)
cursor is from the database connection. data is a list of tuples in the form (idx, a, b) of my values.
This is due to the default behavior of how NULL works in these situations. NULL is generally an unknown type, which is then treated as whatever type is necessary.
In a values() statement, Postgres tries to decipher the types. It treats the individual records as it would with a union. But if all are NULL . . . well, then there is no information. And Postgres decides on using text as the universal default.
It is also important to understand that this fails with the same error:
UPDATE test t
SET a = ''
WHERE t.id = 1;
The issue is that Postgres does not convert empty strings to numbers (unlike some other databases).
In any case, this is easily fixed by casting the NULL to an appropriate type:
UPDATE test t
SET a = e.a,b = e.b
FROM (VALUES (1, NULL::numeric, NULL::numeric),
(2, NULL, NULL),
(3, NULL, NULL)
) e(idx, a, b)
WHERE t.idx = e.idx ;
You can be explicit for all occurrences of NULL, but that is not necessary.
Here is a db<>fiddle that illustrates some of this.

How to create composite UNIQUE constraint with nullable columns?

Let's say I have a table with several columns [a, b, c, d] which can all be nullable. This table is managed with Typeorm.
I want to create a unique constraint on [a, b, c]. However this constraint does not work if one of these column is NULL. I can insert for instance [a=0, b= 1, c=NULL, d=0] and [a=0, b= 1, c=NULL, d=1], where d has different values.
With raw SQL, I could set multiple partial constraints (Create unique constraint with null columns) however, in my case, the unique constraint is on 10 columns. It seems absurd to set a constraint for every possible combination...
I could also create a sort of hash function, but this method does not seem proper to me?
Does Typeorm provide a solution for such cases?
If you have values that can never appear in those columns, you can use them as a replacement in the index:
create unique index on the_table (coalesce(a,-1), coalesce(b, -1), coalesce(c, -1));
That way NULL values are treated the same inside the index, without the need to use them in the table.
If those columns are numeric or float (rather than integer or bigint) using '-Infinity' might be a better substitution value.
There is a drawback to this though:
This index will however not be usable for query on those columns unless you also use the coalesce() expression. So with the above index a query like:
select *
from the_table
where a = 10
and b = 100;
would not use the index. You would need to use the same expressions as used in the index itself:
select *
from the_table
where coalesce(a, -1) = 10
and coalesce(b, -1) = 100;