Guaranteeing `RETURNING` from an upsert while limiting what data is stored - postgresql

I have the following table:
CREATE TABLE scoped_data(
owner_id text,
scope text
key text,
data json,
PRIMARY KEY (owner_id, scope, key)
);
As part of each transaction we will potentially be inserting data for multiple scopes. Given this table has the potential to grow very quickly I would like not to store data if it is NULL or an empty JSON object.
An upsert felt like the idiomatic approach to this. The following is within the context of a PL/pgSQL function:
WITH upserts AS (
INSERT INTO scoped_data (owner_id, scope, key, data)
VALUES
(p_owner_id, 'broad', p_broad_key, p_broad_data),
(p_owner_id, 'narrow', p_narrow_key, p_narrow_data),
-- etc.
ON CONFLICT (owner_id, scope, key)
DO UPDATE SET data = scoped_data.data || COALESCE(EXCLUDED.data, '{}')
RETURNING scope, data
)
SELECT json_object_agg(u.scope, u.data)
FROM upserts u
INTO v_all_scoped_data;
I include the RETURNING as I would like the up-to-date version of each scope's data included in a variable for subsequent use, therefore I need the RETURNING to return something even if logically no data has been updated.
For example (all for key = 1 and scope = 'narrow'):
data = '{}' => v_scoped_data = {}, no data for key = 1 in scoped_data.
data = '{"some":"data"}' => v_scoped_data = { "narrow": { "some": "data" } }, data present in scoped_data.
data = '{}' => v_scoped_data = { "narrow": { "some": "data" }, data from 2. remains unaffected.
data = '{"more":"stuff"}' => v_scoped_data = { "narrow": { "some": "data", "more": "stuff" }. Updated data stored in table.
I initially added a trigger BEFORE INSERT ON scoped_data which did the following:
IF NULLIF(NEW.data, '{}') IS NULL THEN
RETURN NULL;
END IF;
RETURN NEW;
This worked fine for preventing the insertion of new records but the issue was that this trigger also prevented subsequent inserts to existing rows thereby no INSERT happened therefore there was no ON CONFLICT therefore nothing returned in the RETURNING.
A couple of approaches I've considered, both of which feel inelegant or like they should be unnecessary:
Add a CHECK constraint to scoped_data.data: CHECK(NULLIF(data, '{}') IS NOT NULL), allow the insert and catch the exception in the PL/pgSQL code.
DELETE in an AFTER INSERT trigger if the data field was NULL or empty.
Am I going about this in the right way? Am I trying to coerce this logic into an upsert when there is a better way? Might explicit INSERTs and UPDATEs be a more logical fit?
I am using Postgres 9.6.

I would go with the BEFORE trigger ON INSERT to prevent unnecessary inserts and updates.
To return the values even in the case that the operation is not performed, you can UNION ALL your query with a query on scoped_data that returns the original row, ORDER the results so that any new row is ordered first (introduce an artifical column to both queries) and use LIMIT 1 to get the correct result.

Related

PostgreSQL array of data composite update element using where condition

I have a composite type:
CREATE TYPE mydata_t AS
(
user_id integer,
value character(4)
);
Also, I have a table, uses this composite type as an array of mydata_t.
CREATE TABLE tbl
(
id serial NOT NULL,
data_list mydata_t[],
PRIMARY KEY (id)
);
Here I want to update the mydata_t in data_list, where mydata_t.user_id is 100000
But I don't know which array element's user_id is equal to 100000
So I have to make a search first to find the element where its user_id is equal to 100000 ... that's my problem ... I don't know how to make the query .... in fact, I want to update the value of the array element, where it's user_id is equal to 100000 (Also where the id of tbl is for example 1) ... What will be my query?
Something like this (I know it's wrong !!!)
UPDATE "tbl" SET "data_list"[i]."value"='YYYY'
WHERE "id"=1 AND EXISTS (SELECT ROW_NUMBER() OVER() AS i
FROM unnest("data_list") "d" WHERE "d"."user_id"=10000 LIMIT 1)
For example, this is my tbl data:
Row1 => id = 1, data = ARRAY[ROW(5,'YYYY'),ROW(6,'YYYY')]
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'YYYY')]
Now i want to update tbl where id is 2 and set the value of one of the tbl.data elements to 'XXXX' where the user_id of element is equal to 11
In fact, the final result of Row2 will be this:
Row2 => id = 2, data = ARRAY[ROW(10,'YYYY'),ROW(11,'XXXX')]
If you know the value value, you can use the array_replace() function to make the change:
UPDATE tbl
SET data_list = array_replace(data_list, (11, 'YYYY')::mydata_t, (11, 'XXXX')::mydata_t)
WHERE id = 2
If you do not know the value value then the situation becomes more complex:
UPDATE tbl SET data_list = data_arr
FROM (
-- UPDATE doesn't allow aggregate functions so aggregate here
SELECT array_agg(new_data) AS data_arr
FROM (
-- For the id value, get the data_list values that are NOT modified
SELECT (user_id, value)::mydata_t AS new_data
FROM tbl, unnest(data_list)
WHERE id = 2 AND user_id != 11
UNION
-- Add the values to update
VALUES ((11, 'XXXX')::mydata_t)
) x
) y
WHERE id = 2
You should keep in mind, though, that there is an awful lot of work going on in the background that cannot be optimised. The array of mydata_t values has to be examined from start to finish and you cannot use an index on this. Furthermore, updates actually insert a new row in the underlying file on disk and if your array has more than a few entries this will involve substantial work. This gets even more problematic when your arrays are larger than the pagesize of your PostgreSQL server, typically 8kB. All behind the scene so it will work, but at a performance penalty. Even though array_replace sounds like changes are made in-place (and they indeed are in memory), the UPDATE command will write a completely new tuple to disk. So if you have 4,000 array elements that means that at least 40kB of data will have to be read (8 bytes for the mydata_t type on a typical system x 4,000 = 32kB in a TOAST file, plus the main page of the table, 8kB) and then written to disk after the update. A real performance killer.
As #klin pointed out, this design may be more trouble than it is worth. Should you make data_list as table (as I would do), the update query becomes:
UPDATE data_list SET value = 'XXXX'
WHERE id = 2 AND user_id = 11
This will have MUCH better performance, especially if you add the appropriate indexes. You could then still create a view to publish the data in an aggregated form with a custom type if your business logic so requires.

With PostgREST, convert a column to and from an external encoding in the API

We are using PostgREST to automatically generate a REST API for a Postgres database. Our primary keys have an external representation that's different from how we store them internally. For simplicity's sake lets pretend the ids are stored as integers but we represent them as hexadecimal strings outwardly.
It's simple enough to get PostgREST to convert to the external representation for read operations:
CREATE DOMAIN hexid AS bigint;
CREATE TABLE fruits (
fruit_id hexid PRIMARY KEY,
name text
);
CREATE OR REPLACE VIEW api_fruits AS
SELECT to_hex(fruit_id) as fruit_id, name FROM fruits;
INSERT INTO fruits(fruit_id, name) VALUES('51955', 'avocado');
PostgREST generates the expected representation when we GET api_fruits:
[
{
"fruit_id": "caf3",
"name": "avocado"
}
]
But that's about as far as we get with this solution. It's a one way transformation so we won't be able to POST/PATCH records this way. The way PostgREST works is to transform such requests into equivalent INSERT and UPDATE statements. But this view with its custom formatting is not updatable. This is what would happen if we tried:
ERROR: cannot insert into column "fruit_id" of view "api_fruits"
DETAIL: View columns that are not columns of their base relation are not updatable.
STATEMENT: WITH pgrst_source AS (WITH pgrst_payload AS (SELECT $1::json AS json_data), pgrst_body AS ( SELECT CASE WHEN json_typeof(json_data) = 'array' THEN json_data ELSE json_build_array(json_data) END AS val FROM pgrst_payload) INSERT INTO "api_x"."api_fruits"("fruit_id", "name") SELECT "fruit_id", "name" FROM json_populate_recordset (null::"api_x"."api_fruits", (SELECT val FROM pgrst_body)) _ RETURNING "api_x"."api_fruits".*) SELECT '' AS total_result_set, pg_catalog.count(_postgrest_t) AS page_total, CASE WHEN pg_catalog.count(_postgrest_t) = 1 THEN coalesce((
WITH data AS (SELECT row_to_json(_) AS row FROM pgrst_source AS _ LIMIT 1)
SELECT array_agg(json_data.key || '=eq.' || json_data.value)
FROM data CROSS JOIN json_each_text(data.row) AS json_data
WHERE json_data.key IN ('')
), array[]::text[]) ELSE array[]::text[] END AS header, '' AS body, nullif(current_setting('response.headers', true), '') AS response_headers, nullif(current_setting('response.status', true), '') AS response_status FROM (SELECT * FROM pgrst_source) _postgrest_t
We can't INSERT into "View columns that are not columns of their base relation".
The obvious workaround is to serve fruit_id as a straight column, just an integer. With some post and preprocessing at the nginx level we can hex encode it there (and hex decode incoming ids). I'm wondering if we can do better than that though. For large API operations, re-encoding the JSON will use a lot of memory and CPU time and it seems so unnecessary.
It would have been great to be able to use a custom CREATE CAST to take the incoming hexadecimal strings and turn them back into integers, something like this:
CREATE CAST (json AS hexid) WITH FUNCTION json_to_hexid AS ASSIGNMENT;
But alas custom casts are ignored on CREATE DOMAIN types. And we can't make a true custom column type because our cloud Postgres host (Google Cloud SQL) doesn't allow custom extensions.
It feels like some combination of INSTEAD OF triggers or rules could work. But when using query parameters to filter results using query parameters (e.g. select a fruit by id), I don't think there's an appropriate trigger to use. INSTEAD OF doesn't work for straight SELECT does it?
For example I've tested doing something like this to take care of INSERT and allow POST with PostgREST. It works:
CREATE OR REPLACE FUNCTION api_fruits_insert()
RETURNS trigger AS
$$
BEGIN
INSERT INTO fruits(fruit_id, name) VALUES (('x' || lpad(NEW.fruit_id, 16, '0'))::bit(64)::bigint::hexid, NEW.name);
RETURN NEW;
END
$$ LANGUAGE 'plpgsql';
CREATE TRIGGER api_fruits_insert
INSTEAD OF INSERT
ON api_fruits
FOR EACH ROW
EXECUTE PROCEDURE api_fruits_insert();
The trouble is in the WHERE clause. Let's PATCH api_fruits?fruit_id=in.(7b,caf3) with {"name": "pear"}. This works out of the box since the name column is updatable but look at the query:
WITH pgrst_source AS (WITH pgrst_payload AS (SELECT $1::json AS json_data), pgrst_body AS ( SELECT CASE WHEN json_typeof(json_data) = 'array' THEN json_data ELSE json_build_array(json_data) END AS val FROM pgrst_payload) UPDATE "api_x"."api_fruits" SET "name" = _."name" FROM (SELECT * FROM json_populate_recordset (null::"api_x"."api_fruits" , (SELECT val FROM pgrst_body) )) _ WHERE "api_x"."api_fruits"."fruit_id" = ANY ($2) RETURNING 1) SELECT '' AS total_result_set, pg_catalog.count(_postgrest_t) AS page_total, array[]::text[] AS header, '' AS body, nullif(current_setting('response.headers', true), '') AS response_headers, nullif(current_setting('response.status', true), '') AS response_status FROM (SELECT * FROM pgrst_source) _postgrest_t
DETAIL: parameters: $1 = '{
"name": "pear"
}', $2 = '{7b,caf3}'
So we have essentially UPDATE api_fruits SET name='berry' WHERE fruit_id IN ('7b', 'caf3');. Surprisingly this works but it's a full table scan so Postgres can evaluate to_hex(fruit_id) for each row looking for matches. The same happens if we try to GET a record by fruit_id. How would we rewrite the WHERE clauses?
It really feels like some combination of just the right Postgres and PostgREST features should be able to get us to a point where it's all happening in Postgres without nginx's help and without excessive complexity. Any ideas?

SELECT from result of UPDATE ... RETURNING in jOOQ

I'm transforming some old PostgreSQL code to jOOQ, and I'm currently struggling with SQL that has multiple WITH clauses, where each one depends on previous. It would be best to keep the SQL logic the way it was written and not to change it (e.g. multiple queries to DB).
As it seems, there is no way to do SELECT on something that is UPDATE ... RETURNING, for example
dsl.select(DSL.asterisk())
.from(dsl.update(...)
.returning(DSL.asterisk())
)
I've created some test tables, trying to create some sort of MVCE:
create table dashboard.test (id int primary key not null, data text); --test table
with updated_test AS (
UPDATE dashboard.test SET data = 'new data'
WHERE id = 1
returning data
),
test_user AS (
select du.* from dashboard.dashboard_user du, updated_test -- from previous WITH
where du.is_active AND du.data = updated_test.data
)
SELECT jsonb_build_object('test_user', to_jsonb(tu.*), 'updated_test', to_jsonb(ut.*))
FROM test_user tu, updated_test ut; -- from both WITH clauses
So far this is my jOOQ code (written in Kotlin):
dsl.with("updated_test").`as`(
dsl.update(Tables.TEST)
.set(Tables.TEST.DATA, DSL.value("new data"))
.returning(Tables.TEST.DATA) //ERROR is here: Required Select<*>, found UpdateResultStep<TestRecord>
).with("test_user").`as`(
dsl
.select(DSL.asterisk())
.from(
Tables.DASHBOARD_USER,
DSL.table(DSL.name("updated_test")) //or what to use here?
)
.where(Tables.DASHBOARD_USER.IS_ACTIVE.isTrue
.and(Tables.DASHBOARD_USER.DATA.eq(DSL.field(DSL.name("updated_test.data"), String::class.java)))
)
)
.select() //here goes my own logic for jsonBBuildObject (which is tested and works for other queries)
.from(
DSL.table(DSL.name("updated_test")), //what to use here
DSL.table(DSL.name("test_user")) //or here
)
Are there any workarounds for this? I'd like to avoid changing SQL if possible.
Also, in this project this trick is used very often to get JSON(B) from UPDATE clause (table has JSON(B) columns too):
with _updated AS (update dashboard.test SET data = 'something' WHERE id = 1 returning *)
select to_jsonb(_updated.*) from _updated;
and it will be a real step back for us if there is no workaround for this.
I'm using JOOQ version 3.13.3, and Postgres 12.0.
This is currently not supported in jOOQ, see:
https://github.com/jOOQ/jOOQ/issues/3185
https://github.com/jOOQ/jOOQ/issues/4474
The workaround is, as always, when some vendor specific syntax is unsupported, to resort to plain SQL templating
E.g.
// If you don't need to map data types
dsl.fetch("with t as ({0}) {1}", update, select);
// If you need to map data types
dsl.resultQuery("with t as ({0}) {1}", update, select).coerce(select.getSelect()).fetch();

Using session.query(cls).from_statement to do insert with "on conflict ... returning *" multiple times does not reflect changes until commit

I'm using sqlalchemy 1.3.0 with postgres 11. I'm trying to use an INSERT with ON CONFLICT DO UPDATE ... RETURNING * in order to create an instance of my model.
class Model(Base):
__tablename__ = 'mytable'
pk = Column(String(64), primary_key=True)
col2 = Column(String(64))
#classmethod
def upsert(cls, pk, col2, session):
return session.query(cls).from_statement(
text(
"""
INSERT INTO {} (pk, col2) VALUES (:pk, :col2)
ON CONFLICT (pk) DO UPDATE SET col2=EXCLUDED.col2 RETURNING *;
""".format(cls.__tablename__)
)
).params(pk=pk, col2=col2).one()
obj1 = Model.upsert(1, 'one', session)
obj2 = Model.upsert(1, 'two', session)
print(obj2.col2) ----> outputs "one"
session.commit()
print(obj2.col2) ----> outputs "two"
The second upsert does issue the correct command to the database, but printing the col2 attribute of the object returned shows the value of the column that was inserted during the first upsert. Then, if I do a session.commit(), the object magically gets updated to show the new value. What am I missing? I want the object returned from the function to reflect the values that the row was updated to, without having to do a commit, because I want this along with several other things to happen within a transaction.
It turns out that the session.commit() was essentially expiring the object so that the following print(obj2.col2) would requery the database to populate its attributes.
In order to have obj2.col2 be the correct value without committing (and requerying the database), I just needed to do session.expunge(obj1) before the second upsert call to create obj2.

Track updated columns postgresql trigger?

I am using postgreSQL and I have tables on which I use triggers to get notified of changes on tables.
Now, I have a usecase where when an update on a table is done, I want to notify only the updated columns of my table. So if my table has 10 columns and only 5 get updated I need to notify only of 5 updated columns.
One approach would be use OLD and NEW on every column and compare. This would lead to a separate function for each table.
Is there any functionality in postgreSQL pertianing to such a case?
You can create triggers in pl/tcl or pl/perl.
Both produce all the information about the context of the table that triggers the function in a bunch of variables, and OLD and NEW are associative arrays, so you have a lot of freedom to do whatever you like. Eg:
CREATE OR REPLACE FUNCTION valid_id() RETURNS trigger AS $$
my ($new, $old) = ($_TD->{new}, $_TD->{old});
my (#allflds) = keys %$new;
my (#changed) = grep { $new->{$_} ne %$old->{$_} } #allflds;
my (%difold, %difnew);
%difold = map { _$ => $old->{_$} } #changed;
%difnew = map { _$ => $new->{_$} } #changed;
... notify the values in %difold and %difnew as you need ...
$$ LANGUAGE plperl;