Postgres: How to transform data in two cursors before comparison? - postgresql

I am replacing a legacy function get_data in our database which takes some entity_id and returns a refcursor.
I am writing a new function get_data_new which is using different data sources, but the outputs are expected to be the same as get_data for the same input.
I'd like to verify this with pgtap, and am doing so as follows within a test (with _expected and _actual being the names of the returned cursors):
SELECT schema.get_data('_expected', 123);
SELECT schema.get_data_new('_actual', 123);
SELECT results_eq(
'FETCH ALL FROM _actual',
'FETCH ALL FROM _expected',
'get_data_new should return identical results to the legacy version'
);
This works as expected for other functions, but the query in get_data happens to return some json columns meaning that comparison expectedly fails with ERROR: could not identify an equality operator for type json.
I'd rather leave the legacy function untouched so refactoring to jsonb isn't possible. I'm imagining a workaround to be transforming the data before comparison, hypothetically with something like SELECT entity_id, json_column::jsonb FROM (FETCH ALL FROM _actual), but this specific attempt obviously isn't valid.
What would be a suggested approach here? Write a helper function to insert data from the cursors into a couple of temporary tables? I'm hoping there's a cleaner solution I haven't discovered.
Using postgres 11.14, pgtap11

Have solved it by creating a function to loop over the cursor and return the results as a table. Unfortunately this isn't a generic solution - it only works for the cursors with specific data.
In this specific case, json_column can be implicitly converted to type jsonb so this is all that is needed. However, we can now SELECT * FROM cursor_to_table('_actual') meaning we can do whatever transformations we require on the result.
CREATE OR REPLACE FUNCTION cursor_to_table(_cursor refcursor)
RETURNS TABLE (entity_id bigint, json_column jsonb)
AS $func$
BEGIN
LOOP
FETCH _cursor INTO entity_id, json_column
EXIT WHEN NOT FOUND;
RETURN NEXT;
END LOOP;
RETURN;
END
$func$ LANGUAGE plpgsql;
SELECT results_eq(
'SELECT * FROM cursor_to_table(''_actual'')',
'SELECT * FROM cursor_to_table(''_expected'')',
'get_data_new should return identical results to the legacy version'
);

Related

Is it possible to write polymorphic Postgres functions using RECORD parameters?

I want to write a PL/pgSQL function that can take records of different types, check the type of record provided, then do something with the record. Example:
CREATE FUNCTION polymorphic_input(arg_rec RECORD) RETURNS TEXT LANGUAGE plpgsql AS
$plpgsql$
BEGIN
IF pg_typeof(arg_rec)::text = 'information_schema.tables' THEN
RETURN (arg_rec::information_schema.tables).table_name;
ELSIF pg_typeof(arg_rec)::text = 'information_schema.columns' THEN
RETURN (arg_rec::information_schema.columns).column_name;
ELSE
RETURN 'unknown';
END IF;
END;
$plpgsql$;
When you call the function with a row from the information_schema.tables table, it should return the name of the table and it does so when you call it like this:
-- this returns table name "pg_type"
SELECT polymorphic_input((SELECT t FROM information_schema.tables t WHERE table_name = 'pg_type' LIMIT 1));
When you call the function with a row from the information_schema.columns table, it should return the name of the column and it does so when you call it like this:
-- this returns column name "objsubid"
SELECT polymorphic_input((SELECT t FROM information_schema.columns t WHERE t.column_name = 'objsubid' LIMIT 1));
The problem is you CAN'T call the function twice in a row with different row types. For example, if you call it with a row from information_schema.columns it works, then when you call it with a row form information_schema.tables, you get an error like this:
type of parameter 1 (information_schema.tables) does not match that when preparing the plan (information_schema.columns)
The words "when preparing the plan" gave me a hint that Postgres is caching the plans, so I figured running DISCARD PLANS; before each call to the function would work, and indeed it does when you run this entire query:
DISCARD PLANS; SELECT polymorphic_input((SELECT t FROM information_schema.tables t WHERE table_name = 'pg_type' LIMIT 1));
DISCARD PLANS; SELECT polymorphic_input((SELECT t FROM information_schema.columns t WHERE t.column_name = 'objsubid' LIMIT 1));
Running DISCARD PLANS; seems like the nuclear option and would no doubt affect performance in a real-world scenario. After some experimentation, I saw that using the pg_typeof function is what forces the plans to be cached. We can rewrite the function to avoid pg_typeof by adding a parameter that specifies what record type to expect:
CREATE FUNCTION polymorphic_input2(arg_rec RECORD, arg_type text) RETURNS TEXT LANGUAGE plpgsql AS
$plpgsql$
BEGIN
IF arg_type = 'tables' THEN
RETURN (arg_rec::information_schema.tables).table_name;
ELSIF arg_type = 'columns' THEN
RETURN (arg_rec::information_schema.columns).column_name;
ELSE
RETURN 'unknown';
END IF;
END;
$plpgsql$;
You can then call polymorphic_input2 multiple times in a row with different row types without error as follows:
-- no need for DISCARD PLANS here...these calls work fine.
SELECT polymorphic_input2((SELECT t FROM information_schema.tables t WHERE table_name = 'pg_type' LIMIT 1), 'tables');
SELECT polymorphic_input2((SELECT t FROM information_schema.columns t WHERE t.column_name = 'objsubid' LIMIT 1), 'columns');
The problem with polymorphic_input2 is that you have to manually give it a hint as to the type of the record to expect. My question: is it possible to implement a polymorphic function that can figure out the type of record passed to it, without the cached plan errors?
The docs mention the plan_cache_mode setting:
Prepared statements (either explicitly prepared or implicitly generated, for example by PL/pgSQL) can be executed using custom or generic plans. Custom plans are made afresh for each execution using its specific set of parameter values, while generic plans do not rely on the parameter values and can be re-used across executions....The allowed values are auto (the default), force_custom_plan and force_generic_plan...
I tried removing the error by running SET plan_cache_mode = force_custom_plan; but that didn't help (which is probably a bug because the docs imply it should force a custom plan in each call, but Postgres is still caching the plan and causing errors). Only DISCARD PLANS worked.
The docs on plan caching seem to recognize this issue and say:
The mutable nature of record variables presents another problem in this connection. When fields of a record variable are used in expressions or statements, the data types of the fields must not change from one call of the function to the next, since each expression will be analyzed using the data type that is present when the expression is first reached. EXECUTE can be used to get around this problem when necessary.
...and a little further down the docs indicate this shouldn't be happening:
Likewise, functions having polymorphic argument types have a separate statement cache for each combination of actual argument types they have been invoked for, so that data type differences do not cause unexpected failures.
This is further confirmed by the docs EXECUTE which say:
Also, there is no plan caching for commands executed via EXECUTE. Instead, the command is always planned each time the statement is run. Thus the command string can be dynamically created within the function to perform actions on different tables and columns.
So I tried another variant that tries to run pg_typeof via EXECUTE:
CREATE FUNCTION polymorphic_input3(arg_rec RECORD) RETURNS TEXT LANGUAGE plpgsql AS
$plpgsql$
DECLARE
rec_type text;
BEGIN
EXECUTE 'SELECT pg_typeof($1)' INTO rec_type USING arg_rec;
IF rec_type = 'information_schema.tables' THEN
RETURN (arg_rec::information_schema.tables).table_name;
ELSIF rec_type = 'information_schema.columns' THEN
RETURN (arg_rec::information_schema.columns).column_name;
ELSE
RETURN 'unknown';
END IF;
END;
$plpgsql$;
...but that still produces the same error as the variant which calls pg_typeof directly.
My question once again: is it possible (in Postgres 14) to implement a polymorphic function that can figure out the type of record passed to it, without the cached plan errors?

Capture number of rows affected by dynamic sql?

I am trying to get the return from a QUERY EXEUTE in a plpgsql function to be able to check how many rows were affected from a dynamic update query. My use case is adding an event (with a custom payload) to a separate table on insert or update to a dynamically set table. Because my event has a custom payload, I have not been able to use a database trigger (e.g. trigger before insert). As a simplified example, assume I have this table:
CREATE TABLE users (user_id text primary key, name text)
Here is my simplified events table:
CREATE TABLE events(event_id text primary key, payload json)
Here is my simplified function:
CREATE OR REPLACE FUNCTION my_function(_rowtype anyelement, q text, payload jsonb)
RETURNS SETOF anyelement AS
$func$
DECLARE
event_id text;
BEGIN
SELECT jsonb_object_field_text (payload, 'id')::text INTO STRICT event_id;
execute format('insert into event(event_id, payload) values ($1, $2)') using event_id, payload;
RETURN QUERY EXECUTE format('%s', q);
END
$func$ LANGUAGE plpgsql;
The goal is to have this work exactly the same as if someone had created these in a transaction. In pseucode for insert:
BEGIN
insert into events(id, payload) values($1, $2)
insert into users(columns) values(<any values>)
COMMIT
and similarly for update:
BEGIN
insert into events(id, payload) values($1, $2)
result, error := query(`update users set name = 'hello' where id = 'Not Exists Thus No Rows Modified'`);
if result.rowsAffected() == 0 {
ROLLBACK
}
COMMIT
The function my_function almost works except for one edge case: when an update actually doesn't affect any rows.
For example, this works:
select * from my_function(NULL::users,
'insert into users(id,name) values('u1', ''a2'') returning *',
payload => '{"id": "e1", "custom": "s1", "field": "2019-10-12T07:20:50.52Z"}')
As expected, after this is done both a row in the users table and the events table is created.
What fails is the following:
select * from my_function(NULL::users,
'update users set name = ''hello'' where user_id = ''NotExists'' returning *',
payload => '{"id": "e2", "custom": "s3", "field": "2019-10-12T07:20:50.52Z"}')
Here, a row is created in the events table (my goal is that it should not be created).
I know this approach is not elegant, and I know this is vulnerable to SQL injection. I'd love suggestions on better ways to solve this (including scrapping what we're doing now). But to answer the question directly, I'm looking to store the result of QUERY EXECUTE, check if any rows were affected, and raise an error so that there is never a case where a row in the events table is created when there is not real corresponding change in the users table. Users table is just an example, in general, it could be any dynamically set table.
A RETURN QUERY doesn't need to go to the end of the function, it only says: "the results of this query are part of the resulting set".
So you can use the RETURN QUERY, ask for FOUND and act accordingly. Here is your function modified for working this way:
CREATE OR REPLACE FUNCTION public.my_function(_rowtype anyelement, q text, payload jsonb)
RETURNS SETOF anyelement
LANGUAGE plpgsql
AS $function$
DECLARE
event_id text;
BEGIN
SELECT jsonb_object_field_text (payload, 'id')::text INTO STRICT event_id;
RETURN QUERY EXECUTE format('%s', q);
IF FOUND THEN
execute format('insert into events(event_id, payload) values ($1, $2)') using event_id, payload;
END IF;
RETURN;
END
$function$
PD: Maybe you can also solve your problem with triggers FOR EACH STATEMENT using the transition tables OLD and NEW (which are available since v10, https://www.postgresql.org/docs/10/sql-createtrigger.html)
CREATE OR REPLACE FUNCTION my_function(_rowtype anyelement, q text, payload jsonb)
RETURNS SETOF anyelement
LANGUAGE plpgsql AS
$func$
BEGIN
RETURN QUERY EXECUTE q;
IF NOT FOUND THEN
RETURN; -- nothing happened yet, we can exit silently.
-- Or you WANT an error for this case. Then do this instead:
-- RAISE EXCEPTION 'Query passed in parameter "q" did not affect any rows. Doing nothing!';
END IF;
INSERT INTO event(event_id, payload)
VALUES (payload->>'id', payload);
END
$func$;
As has been commented, RETURN QUERY does not return from the function. The manual:
RETURN NEXT and RETURN QUERY do not actually return from the
function — they simply append zero or more rows to the function's
result set. Execution then continues with the next statement in the
PL/pgSQL function. As successive RETURN NEXT or RETURN QUERY
commands are executed, the result set is built up. A final RETURN,
which should have no argument, causes control to exit the function (or
you can just let control reach the end of the function).
There's a code example for your case exactly at the bottom of that chapter in the manual. From me, actually. Originating here:
FUNCTION syntax error
It was suggested to use GET DIAGNOSTICS instead of the simpler FOUND. It's true that EXECUTE does not set the state of FOUND. But RETURN QUERY does. So keep using the simpler FOUND. Related:
Dynamic SQL (EXECUTE) as condition for IF statement
You have format() in your original twice. And while that's typically very useful for dynamic SQL, it's useless in your case. EXECUTE format('%s', q) is exactly the same as just EXECUTE q, with added cost. Both are open doors for SQL injection when passing user input.
While there is a good chance that the transaction might be rolled back, start with the critical step, and do the rest later. Avoid wasting the work. So I moved executing q to the top. Assuming it does not depend on the "payload" row, now inserted later.
Also, INSERT INTO events can be plain SQL. Nothing dynamic there. No need for format() or EXECUTE.
Finally, assuming your jsonb_object_field_text (payload, 'id')::text is just a fancy way of saying payload->>'id'. No need for an additional variable and another SELECT INTO.
Warning against SQL injection
Converting user input (parameter q in the example) to code to execute dynamically is the most direct SQL injection vulnerability of all. I wouldn't want to be caught in my underwear doing that.

plpgsql : how to reference variable in sql statement

I am rather new to PL/pgSQL and don't know how to reference a variable in a SELECT statement.
In this following function the SELECT INTO always returns NULL:
$body$
DECLARE
r RECORD;
c CURSOR FOR select name from t_people;
nb_bills integer;
BEGIN
OPEN c;
LOOP
FETCH c INTO r;
EXIT WHEN NOT FOUND;
RAISE NOTICE 'name found: %', r.name;
SELECT count(bill_id) INTO nb_bills from t_bills where name = r.name;
END LOOP;
END;
$body$
RAISE NOTICE allows me to verify that my CURSOR is working well: names are properly retrieved, but for some reason still unknown to me, not properly passed to the SELECT INTO statement.
For debugging purpose, I tried to replace the variable in SELECT INTO with a constant value and it worked:
SELECT count( bill_id) INTO nb_bills from t_bills where name = 'joe';
I don't know how to reference r.name in the SELECT INTO statement.
I tried r.name, I tried to create another variable containing quotes, it is always returning NULL.
I am stuck. If anyone knows ...
the SELECT INTO always returns NULL
not properly passed to the SELECT INTO statement.
it is always returning NULL.
None of this makes sense.
SELECT statements do not return anything by itself in PL/pgSQL. You have to either assign results to variables or explicitly return results with one of the available RETURN variants.
SELECT INTO is only used for variable assignment and does not return anything, either. Not to be confused with SELECT INTO in plain SQL - which is generally discouraged:
Combine two tables into a new one so that select rows from the other one are ignored
It's not clear what's supposed to be returned in your example. You did not disclose the return type of the function and you did not actually return anything.
Start by reading the chapter Returning From a Function in the manual.
Here are some related answers with examples:
Can I make a plpgsql function return an integer without using a variable?
Return a query from a function?
How to return multiple rows from PL/pgSQL function?
Return setof record (virtual table) from function
And there may be naming conflicts with parameter names. We can't tell unless you post the complete function definition.
Better approach
That said, explicit cursors are only actually needed on rare occasions. Typically, the implicit cursor of a FOR loop is simpler to handle and cheaper:
Truncating all tables in a Postgres database
And most of the time you don't even need any cursors or loops at all. Set-based solutions are typically simpler and faster.

Print ASCII-art formatted SETOF records from inside a PL/pgSQL function

I would love to exploit the SQL output formatting of PostgreSQL inside my PL/pgSQL functions, but I'm starting to feel I have to give up the idea.
I have my PL/pgSQL function query_result:
CREATE OR REPLACE FUNCTION query_result(
this_query text
) RETURNS SETOF record AS
$$
BEGIN
RETURN QUERY EXECUTE this_query;
END;
$$ LANGUAGE plpgsql;
..merrily returning a SETOF records from an input text query, and which I can use for my SQL scripting with dynamic queries:
mydb=# SELECT * FROM query_result('SELECT ' || :MYVAR || ' FROM Alice') AS t (id int);
id
----
1
2
3
So my hope was to find a way to deliver this same nicely formatted output from inside a PL/pgSQL function instead, but RAISE does not support SETOF types, and there's no magic predefined cast from SETOF records to text (I know I could create my own CAST..)
If I create a dummy print_result function:
CREATE OR REPLACE FUNCTION print_result(
this_query text
) RETURNS void AS
$$
BEGIN
SELECT query_result(this_query);
END;
$$ LANGUAGE plpgsql;
..I cannot print the formatted output:
mydb=# SELECT print_result('SELECT ' || :MYVAR || ' FROM Alice');
ERROR: set-valued function called in context that cannot accept a set
...
Thanks for any suggestion (which preferably works with PostgreSQL 8.4).
Ok, to do anything with your result set in print_result you'll have to loop over it. That'll look something like this -
Here result_record is defined as a record variable. For the sake of explanation, we'll also assume that you have a formatted_results variable that is defined as text and defaulted to a blank string to hold the formatted results.
FOR result_record IN SELECT * FROM query_result(this_query) AS t (id int) LOOP
-- With all this, you can do something like this
formatted_results := formatted_results ||','|| result_record.id;
END LOOP;
RETURN formatted_results;
So, if you change print_results to return text, declare the variables as I've described and add this in, your function will return a comma-separated list of all your results (with an extra comma at the end, I'm sure you can make use of PostgreSQL's string functions to trim that). I'm not sure this is exactly what you want, but this should give you a good idea about how to manipulate your result set. You can get more information here about control structures, which should let you do pretty much whatever you want.
EDIT TO ANSWER THE REAL QUESTION:
The ability to format data tuples as readable text is a feature of the psql client, not the PostgreSQL server. To make this feature available in the server would require extracting relevant code or modules from the psql utility and recompiling them as a database function. This seems possible (and it is also possible that someone has already done this), but I am not familiar enough with the process to provide a good description of how to do that. Most likely, the best solution for formatting query results as text will be to make use of PostgreSQL's string formatting functions to implement the features you need for your application.

Accessing columns inside a record using table alias in Postgresql

I'm trying to iterate over the result of a query using a record data type. Nevertheless, if I try to access one column using the table alias defined in the query, I get the following error:
ERRO: schema "inv_row" does not exist
CONTEXT: SQL command "SELECT inv_row.s.processor <> inv_row.d.processor"
PL/pgSQL function "teste" line 7 at IF
Here is the code that throws this error:
CREATE OR REPLACE FUNCTION teste() returns void as $$
DECLARE
inv_row record;
BEGIN
FOR inv_row in SELECT * FROM sa_inventory s LEFT JOIN dim_inventory d ON s.macaddr = d.macaddr LOOP
IF inv_row.s.processor <> inv_row.d.processor THEN
<do something>;
END IF;
END LOOP;
END;
$$ language plpgsql;
Is there another way to access a column of a particular table inside a record data type?
Fortunately the answer here is relatively simple. You have to use parentheses to indicate tuples:
IF (inv_row.s).processor <> (inv_row.d).processor THEN
This is because SQL specifies meaning to the depth of the namespaces and therefore without this PostgreSQL cannot safely determine what this means. So inv_row.s.processor means the processor column of the s table in the inv_row schema. However (inv_row.s).processor means take the s column of the inf_row table, treat it as a tuple, and take the processor column of that.