Pentaho Data Integration Input / Output Bit Type Error - postgresql

I am using Pentaho Data Integration for numerous projects at work. We predominantly use Postgres for our database's. One of our older tables has two columns that are set to type bit(1) to store 0 for false and 1 for true.
My task is to synchronize a production table with a copy in our development environment. I am reading the data in using Table Input and immediately trying to do an Insert/Update. However, it fails because of the conversion to Boolean by PDI. I updated the query to cast the values to integers to retain the 0 and 1 but when I run it again, my transformation fails because an integer cannot be a bit value.
I have looked for several days trying different things like using the javascript step to convert to a bit but I have not been able to successfully read in a bit type and use the Insert/Update step to store the data. I also do not believe the Insert/Update step has the capabilities of updating the SQL that is being used to define the data type for the column.
The database connection is set up using:
Connection Type: PostgreSQL
Access: Native (JDBC)
Supports the boolean data type: true
Quote all in database: true
Note: Altering the table to change the datatype is not optional at this point in time. Too many applications currently depend on this table so altering it in this way could cause undesirable affects
Any help would be appreciated. Thank you.

You can create cast object (for example from character varying to bit) in your destination database with "as assignment" option. AS ASSIGNMENT allows to apply this cast automatically during inserts.
http://www.postgresql.org/docs/9.3/static/sql-createcast.html
Here is some proof-of-concept for you:
CREATE FUNCTION cast_char_to_bit (arg CHARACTER VARYING)
RETURNS BIT(1) AS
$$
SELECT
CASE WHEN arg = '1' THEN B'1'
WHEN arg = '0' THEN B'0'
ELSE NULL
END
$$
LANGUAGE SQL;
CREATE CAST (CHARACTER VARYING AS BIT(1))
WITH FUNCTION cast_char_to_bit(CHARACTER VARYING)
AS ASSIGNMENT;
Now you should be able to insert/update single-character strings into bit(1) column. However, you will need to cast your input column to character varying/text, so that it would be converted to String after in the table input step and to CHARACTER VARYING in the insert/update step.
Probably, you could create cast object using existing cast functions, which are defined in postgres already (see pg_cast, pg_type and pg_proc tables, join by oid), but I haven't managed to do this, unfortunately.
Edit 1:
Sorry for the previous solution. Adding a cast from boolean to bit looks much more reasonable: you will not even need to cast data in your table input step.
CREATE FUNCTION cast_bool_to_bit (arg boolean)
RETURNS BIT(1) AS
$$
SELECT
CASE WHEN arg THEN B'1'
WHEN NOT arg THEN B'0'
ELSE NULL
END
$$
LANGUAGE SQL;
CREATE CAST (BOOLEAN AS BIT(1))
WITH FUNCTION cast_bool_to_bit(boolean)
AS ASSIGNMENT;

I solved this by writing out the Postgres insert SQL (with B'1' and B'0' for the bit values) in a previous step and using "Execute row SQL Script" at the end to run each insert as individual SQL statements.

Related

Can I use to_char() and make_date() in postgreSQL table definition?

I'm working on a poc to migrate an on-prem SQL Server database to Amazon Aurora for PostgreSQL. Amazon's Schema Conversion Tool struggled to translate the SQL Server code for the creation of a table on this column:
[DOB] AS (CONVERT([varchar],datefromparts([DOB_year],[DOB_month],[DOB_day]),(120))) PERSISTED,
as the CONVERT function is unsupported in Postgres.
The best translation I can come up with is:
dob varchar(30) GENERATED ALWAYS AS (to_char((make_date(dob_year, dob_month, dob_day))::timestamp, 'YYYY-MM-DD HH24:MI:SS')) STORED,
but neither the SCT nor pgAdmin4 are recognising to_char() and make_date() as functions. 'dob_day', 'dob_month' and 'dob_year' are all column names with datatype of integer. I'm new to all this but another column definition is using other functions, e.g. replace() and right(), successfully, so I'm confused why this isn't working.
When I tried to run the code in pgAdmin I got this error:
ERROR: generation expression is not immutable
SQL state: 42P17
Thanks
to_char() is is not marked as immutable even though in your case it would be. But there are format masks that are not immutable if e.g. time zones or different locales are involved.
If you really want to (or are forced to) convert day,month, year in a formatted string (rather than a proper date which would be the correct thing to do), then you can only achieve this with a custom function:
create function create_string_date(p_year int, p_month int, p_day int)
returns text
as
$$
select to_char(make_date(p_year, p_month, p_day), 'yyyy-mm-dd hh24:mi:ss');
$$
language sql
immutable;
Marking the function as immutable isn't cheating, because we know that with the given input and format string this is indeed immutable.
dob text generated always as (create_string_date(dob_year, dob_month, dob_day)) stored

Postgres: getting "... is out of range for type integer" when using NULLIF

For context, this issue occurred in a Go program I am writing using the default postgres database driver.
I have been building a service to talk to a postgres database which has a table similar to the one listed below:
CREATE TABLE object (
id SERIAL PRIMARY KEY NOT NULL,
name VARCHAR(255) UNIQUE,
some_other_id BIGINT UNIQUE
...
);
I have created some endpoints for this item including an "Install" endpoint which effectively acts as an upsert function like so:
INSERT INTO object (name, some_other_id)
VALUES ($1, $2)
ON CONFLICT name DO UPDATE SET
some_other_id = COALESCE(NULLIF($2, 0), object.some_other_id)
I also have an "Update" endpoint with an underlying query like so:
UPDATE object
SET some_other_id = COALESCE(NULLIF($2, 0), object.some_other_id)
WHERE name = $1
The problem:
Whenever I run the update query I always run into the error, referencing the field "some_other_id":
pq: value "1010101010144" is out of range for type integer
However this error never occurs on the "upsert" version of the query, even when the row already exists in the database (when it has to evaluate the COALESCE statement). I have been able to prevent this error by updating COALESCE statement to be as follows:
COALESCE(NULLIF($2, CAST(0 AS BIGINT)), object.some_other_id)
But as it never occurrs with the first query I wondered if this inconsitency had come from me doing something wrong or something that I don't understand? And also what the best practice is with this, should I be casting all values?
I am definitely passing in a 64 bit integer to the query for "some_other_id", and the first query works with the Go implementation even without the explicit type cast.
If any more information (or Go implementation) is required then please let me know, many thanks in advance! (:
Edit:
To eliminate confusion, the queries are being executed directly in Go code like so:
res, err := s.db.ExecContext(ctx, `UPDATE object SET some_other_id = COALESCE(NULLIF($2, 0), object.some_other_id) WHERE name = $1`,
"a name",
1010101010144,
)
Both queries are executed in exactly the same way.
Edit: Also corrected parameter (from $51 to $2) in my current workaround.
I would also like to take this opportunity to note that the query does work with my proposed fix, which suggests that the issue is in me confusing postgres with types in the NULLIF statement? There is no stored procedure asking for an INTEGER arg inbetween my code and the database, at least that I have written.
This has to do with how the postgres parser resolves types for the parameters. I don't know how exactly it's implemented, but given the observed behaviour, I would assume that the INSERT query doesn't fail because it is clear from (name,some_other_id) VALUES ($1,$2) that the $2 parameter should have the same type as the target some_other_id column, which is of type int8. This type information is then also used in the NULLIF expression of the DO UPDATE SET part of the query.
You can also test this assumption by using (name) VALUES ($1) in the INSERT and you'll see that the NULLIF expression in DO UPDATE SET will then fail the same way as it does in the UPDATE query.
So the UPDATE query fails because there is not enough context for the parser to infer the accurate type of the $2 parameter. The "closest" thing that the parser can use to infer the type of $2 is the NULLIF call expression, specifically it uses the type of the second argument of the call expression, i.e. 0, which is of type int4, and it then uses that type information for the first argument, i.e. $2.
To avoid this issue, you should use an explicit type cast with any parameter where the type cannot be inferred accurately. i.e. use NULLIF($2::int8, 0).
COALESCE(NULLIF($51, CAST(0 AS BIGINT)), object.some_other_id)
Fifty-one? Realy?
pq: value "1010101010144" is out of range for type integer
Pay attention, the data type in the error message is an integer, not bigint.
I think the reason for the error is out of showed code. So I take out a magic crystal ball and make a pass with my hands.
an "Install" endpoint which effectively acts as an upsert function like so
I also have an "Update" endpoint
Do you call endpoint a PostgreSQL function (stored procedure)? I think yes.
Also $1, $2 looks like PostgreSQL function arguments.
The magic crystal ball says: you have two PostgreSQL function with different data types of arguments:
"Install" endpoint has $2 function argument as a bigint data type. It looks like CREATE FUNCTION Install(VARCHAR(255), bigint)
"Update" endpoint has $2 function argument as an integer data type, not bigint. It looks like CREATE FUNCTION Update(VARCHAR(255), integer).
At last, I would rewrite your condition more understandable:
UPDATE object
SET some_other_id =
CASE
WHEN $2 = 0 THEN object.some_other_id
ELSE $2
END
WHERE name = $1

Stored procedure in db2 with cursor return type

I am developing a stored procedure (SP) in db2 which will return some data in the form of output cursor but field lengths for different field may vary. I am facing issues as I am not able to make SP compile for this requirement. Below is the code for reference
create table employee(id bigint,first_name varchar(128),last_name varchar(128));
create table employee_designation(id bigint, emp_id bigint,
designation varchar(128));
create type empRowType as row(first_name varchar(128),last_name varchar(128),
designation varchar(128));
create type empCursorType as empRowType cursor;
create procedure emp_designation_lookup(in p_last_name varchar(128), out emp_rec empCursorType)
result sets 0
language SQL
begin
set emp_rec = cursor for select a.first_name,a.last_name,b.designation
from employee a, employee_designation b
where a.id=b.EMP_ID
and a.LAST_NAME = p_last_name;
end;
the above SP compiles and return the result as intended. However if I change the row definition as below
create type empRowType as row(first_name varchar(120),last_name varchar(128),
designation varchar(128));
On recompiling the SP, I get the following error
BMS sample -- [IBM][CLI Driver][DB2/NT64] SQL0408N A value is not compatible with the
data type of its assignment target. Target name is "EMP_REC". LINE NUMBER=5. SQLSTATE=42821
The error is coming as first_name defined in cursor is not of same length in the table employee(cursor has 120 whereas table has 128)
However for my actual work, I need the return values computed based on some logic and hence the length specified in the cursor will be different from what is there in the table. Also I have some new columns in the cursor which are not related with table's column (for example determining the bonus amount or should the employee be promoted etc).
I want to know if there is indeed some solution to such scenario specifically to db2. I am new to db2 and am using version 10.5.7. I also explored multiple articles in IBM docs but not able to find the exact resolution. Any help of pointers will be of great help.
When you use a strongly typed cursor, then any assignment involving that cursor must exactly match the relevant type. Otherwise the compiler will throw an exception, which is your symptom.
Db2 SQL PL also supports weak cursors, and an SQL PL procedure output parameter type can be a weak cursor. This means that a stored procedure declaration can use ...OUT p_cur CURSOR (so there is no preassigned user defined type linked to that cursor) , and then assign that output parameter from different queries ( set p_cur = CURSOR FOR SELECT ... ) . In my case the caller is always SQL (not jdbc), but you might experiment with jdbc as IBM gives an example in the Db2-LUW v11.5 documentation.
Most people use simple result-sets (instead of returned cursors) to harvest the output from queries in stored procedures. These result-sets are consumable by all kinds of client applications (jdbc, odbc, cli) and languages that uses those interfaces (java, .net, python, php, perl, javascript, command-line/scripting etc.). So the simple result sets offer more general purpose usability that returned cursor parameters.
IBM publishes various Db2 samples in different places (on github, in the samples directory-tree of your Db2 server instance directory, in the Knowledge Center etc.).

Executing queries dynamically in PL/pgSQL

I have found solutions (I think) to the problem I'm about to ask for on Oracle and SQL Server, but can't seem to translate this into a Postgres solution. I am using Postgres 9.3.6.
The idea is to be able to generate "metadata" about the table content for profiling purposes. This can only be done (AFAIK) by having queries run for each column so as to find out, say... min/max/count values and such. In order to automate the procedure, it is preferable to have the queries generated by the DB, then executed.
With an example salesdata table, I'm able to generate a select query for each column, returning the min() value, using the following snippet:
SELECT 'SELECT min('||column_name||') as minval_'||column_name||' from salesdata '
FROM information_schema.columns
WHERE table_name = 'salesdata'
The advantage being that the db will generate the code regardless of the number of columns.
Now there's a myriad places I had in mind for storing these queries, either a variable of some sort, or a table column, the idea being to then have these queries execute.
I thought of storing the generated queries in a variable then executing them using the EXECUTE (or EXECUTE IMMEDIATE) statement which is the approach employed here (see right pane), but Postgres won't let me declare a variable outside a function and I've been scratching my head with how this would fit together, whether that's even the direction to follow, perhaps there's something simpler.
Would you have any pointers, I'm currently trying something like this, inspired by this other question but have no idea whether I'm headed in the right direction:
CREATE OR REPLACE FUNCTION foo()
RETURNS void AS
$$
DECLARE
dyn_sql text;
BEGIN
dyn_sql := SELECT 'SELECT min('||column_name||') from salesdata'
FROM information_schema.columns
WHERE table_name = 'salesdata';
execute dyn_sql
END
$$ LANGUAGE PLPGSQL;
System statistics
Before you roll your own, have a look at the system table pg_statistic or the view pg_stats:
This view allows access only to rows of pg_statistic that correspond
to tables the user has permission to read, and therefore it is safe to
allow public read access to this view.
It might already have some of the statistics you are about to compute. It's populated by ANALYZE, so you might run that for new (or any) tables before checking.
-- ANALYZE tbl; -- optionally, to init / refresh
SELECT * FROM pg_stats
WHERE tablename = 'tbl'
AND schemaname = 'public';
Generic dynamic plpgsql function
You want to return the minimum value for every column in a given table. This is not a trivial task, because a function (like SQL in general) demands to know the return type at creation time - or at least at call time with the help of polymorphic data types.
This function does everything automatically and safely. Works for any table, as long as the aggregate function min() is allowed for every column. But you need to know your way around PL/pgSQL.
CREATE OR REPLACE FUNCTION f_min_of(_tbl anyelement)
RETURNS SETOF anyelement
LANGUAGE plpgsql AS
$func$
BEGIN
RETURN QUERY EXECUTE (
SELECT format('SELECT (t::%2$s).* FROM (SELECT min(%1$s) FROM %2$s) t'
, string_agg(quote_ident(attname), '), min(' ORDER BY attnum)
, pg_typeof(_tbl)::text)
FROM pg_attribute
WHERE attrelid = pg_typeof(_tbl)::text::regclass
AND NOT attisdropped -- no dropped (dead) columns
AND attnum > 0 -- no system columns
);
END
$func$;
Call (important!):
SELECT * FROM f_min_of(NULL::tbl); -- tbl being the table name
db<>fiddle here
Old sqlfiddle
You need to understand these concepts:
Dynamic SQL in plpgsql with EXECUTE
Polymorphic types
Row types and table types in Postgres
How to defend against SQL injection
Aggregate functions
System catalogs
Related answer with detailed explanation:
Table name as a PostgreSQL function parameter
Refactor a PL/pgSQL function to return the output of various SELECT queries
Postgres data type cast
How to set value of composite variable field using dynamic SQL
How to check if a table exists in a given schema
Select columns with particular column names in PostgreSQL
Generate series of dates - using date type as input
Special difficulty with type mismatch
I am taking advantage of Postgres defining a row type for every existing table. Using the concept of polymorphic types I am able to create one function that works for any table.
However, some aggregate functions return related but different data types as compared to the underlying column. For instance, min(varchar_column) returns text, which is bit-compatible, but not exactly the same data type. PL/pgSQL functions have a weak spot here and insist on data types exactly as declared in the RETURNS clause. No attempt to cast, not even implicit casts, not to speak of assignment casts.
That should be improved. Tested with Postgres 9.3. Did not retest with 9.4, but I am pretty sure, nothing has changed in this area.
That's where this construct comes in as workaround:
SELECT (t::tbl).* FROM (SELECT ... FROM tbl) t;
By casting the whole row to the row type of the underlying table explicitly we force assignment casts to get original data types for every column.
This might fail for some aggregate function. sum() returns numeric for a sum(bigint_column) to accommodate for a sum overflowing the base data type. Casting back to bigint might fail ...
#Erwin Brandstetter, Many thanks for the extensive answer. pg_stats does indeed provide a few things, but what I really need to draw a complete profile is a variety of things, min, max values, counts, count of nulls, mean etc... so a bunch of queries have to be ran for each columns, some with GROUP BY and such.
Also, thanks for highlighting the importance of data types, i was sort of expecting this to throw a spanner in the works at some point, my main concern was with how to automate the query generation, and its execution, this last bit being my main concern.
I have tried the function you provide (I probably will need to start learning some plpgsql) but get a error at the SELECT (t::tbl) :
ERROR: type "tbl" does not exist
btw, what is the (t::abc) notation referred as, in python this would be a list slice, but it’s probably not the case in PLPGSQL

Trigger to check valid input

I am inserting a lot of measurement data from different sources in a postgres database. The data are a measured value and an uncertainty (and a lot of auxilliary data) The problem is that in some cases I get an absolute error value eg 123 +/- 33, in other cases, I get a relative error as a percentage of the measured value, eg 123 +/- 10%. I would like to store all the measurements with absolute error, i.e the latter should be stored as 123 +/- 12.3 - (at this point, I don't care too much about the valid number of digits)
My idea is to use a trigger to do this. Basically, if the error is numeric, store it as is, if it is non-numeric, check if the last character is '%', in that case, multiply it with the measured value, divide by 100 and store the resulting value. I got an isnumeric-function from here: isnumeric() with PostgreSQL which works fine. But when I try to make this into a trigger, it seems as if the input is checked for validity even before the trigger fires, so that the insert is abortet before I get any possibility to do anything with the values.
my triggerfunction: (need to do the calculation, just setting the error to 0 here...
create function my_trigger_function()
returns trigger as'
begin
if not isnumeric(new.err) THEN
new.err=0;
end if;
return new;
end' language 'plpgsql';
then I connect it to the table:
create trigger test_trigger
before insert on test
for each row
execute procedure my_trigger_function();
Doing this, I would expect to get val=123 and err=0 for the following insert
insert into test(val,err) values(123,'10%');
but the insert fails with "invalid input syntax for type numeric" which then must be triggered before my trigger gets any possibility to see the data (or I have misunderstood something basic). Is it possible to make the new.err data-type agnostic or can I run the trigger even earlier or is what I want to do just plain impossible?
It's not possible with a trigger because the SQL parser fails before.
When the trigger is launched, the NEW.* columns already have their definitive types matching the destination columns.
The closest alternative is to provide a function converting from text to numeric implementing your custom syntax rules and apply it in the VALUES clause:
insert into test(val,err) values(123, custom_convert('10%'));
Daniel answered on my original question - and I found out I had to think otherwise. His proposal for how to do it may work for others, but the way my system interfaces to the database by fetching table and column names directly from the database, it would not work well.
Instead I added a boolean field relerr to the measurement table
alter table measure add relerr boolean default false;
Then I made a trigger that checks if relerr is true - indicating that I am trying to store a relative error, if so, it recalculates the error column (called prec for precision)
CREATE FUNCTION calc_fromrel_error()
RETURNS trigger as'
IF NEW.relerr THEN
NEW.prec=NEW.prec*NEW.value/100;
NEW.relerr=FALSE;
END IF;
return NEW;
END' language 'plpgsql';
and then
create trigger meas_calc_relerr_trigger
before update on measure
for each row
execute procedure calc_fromrel_error();
voila, by doing a
INSERT into measure (value,prec,relerr) values(220,10,true);
I get the table populated with 220,22,false. Inserted values should normally never be updated, if that for some strange reason should happen, I will be able to calculate the prec column manually.