I am slowly working through a feature where I am importing large csv files. The contents of the csv file has a chance that when it is uploaded the contents will trigger a uniqueness conflict. I've combed stack overflow for some similar resources but I still can't seem to get my trigger to update another table when a duplicate entry is found. The following code is what I have currently implemented with my line of logic for this process. Also, this is implemented in a rails app but the underlying sql is the following.
When a user uploads a file, the following happens when its processed.
CREATE TEMP TABLE codes_temp ON COMMIT DROP AS SELECT * FROM codes WITH NO DATA;
create or replace function log_duplicate_code()
returns trigger
language plpgsql
as
$$
begin
insert into duplicate_codes(id, campaign_id, code_batch_id, code, code_id, created_at, updated_at)
values (gen_random_uuid(), excluded.campaign_id, excluded.code_batch_id, excluded.code, excluded.code_id, now(), now());
return null;
end;
$$
create trigger log_duplicate_code
after insert on codes
for each row execute procedure log_duplicate_code();
INSERT INTO codes SELECT * FROM codes_temp ct
ON CONFLICT (campaign_id, code)
DO update set updated_at = excluded.updated_at;
DROP TRIGGER log_duplicate_code ON codes;
When I try to run this process nothing happens at all. If I were to have a csv file with this value CODE01 and then upload again with CODE01 the duplicate_codes table doesn't get populated at all and I don't understand why. There is no error that gets triggered or anything so it seems like DO UPATE..... is doing something. What am I missing here?
I also have some questions that come to my mind even if this were to work as intended. For example, I am uploading millions of these codes, etc.
1). Should my trigger be a statement trigger instead of a row for scalability?
2). What if someone else tries to upload another file that has millions of codes? I have my code wrapped in a transaction. Would a new separate trigger be created? Will this conflict with a previously processing process?
####### EDIT #1 #######
Thanks to Adriens' comment I do see that After Insert does not have the OLD key phrase. I updated my code to use EXCLUDED and I receive the following error for the trigger.
ERROR: missing FROM-clause entry for table "excluded" (PG::UndefinedTable)
Finally, here are the S.O. posts I've used to try to tailor my code but I just can't seem to make it work.
####### EDIT #2 #######
I have a little more context on to how this is implemented.
When the CSV is loaded, a staging table called codes_temp is created and dropped at the end of the transaction. This table contains no unique constraints. From what I read only the actual table that I want to insert codes should have the unique constraint error.
In my INSERT statement, the DO update set updated_at = excluded.updated_at; doesn't trigger a unique constraint error. As of right now, I don't know if it should or not. I borrowed this logic taken from this s.o. question postgresql log into another table with on conflict it seemed to me like I had to update something if I specify the DO UPDATE SET clause.
Last, the correct criteria for codes in the database is the following:
For example, this is an example entry in my codes table
id, campaign_id, code
1, 1, CODE01
2, 1, CODE02
3, 1, CODE03
If any of these codes appear again somewhere, This should not be inserted into the codes table but it needs to be inserted into the duplicate_codes table because they were already uploaded before.
id, campaign_id, code
1, 1, CODE01.
2, 1, CODE02
3, 1, CODE03
As for the codes_temp table I don't have any unique constraints, so there is no criteria to select the right one.
postgresql log into another table with on conflict
Postgres insert on conflict update using other table
Postgres on conflict - insert to another table
How to do INSERT INTO SELECT and ON DUPLICATE UPDATE in PostgreSQL 9.5?
Seems to me something like:
INSERT INTO
codes
SELECT
distinct on(campaign_id, code) *
FROM
codes_temp ct
ORDER BY
campaign_id, code, id DESC;
Assuming id was assigned sequentially, the above would select the most recent row into codes.
Then:
INSERT INTO
duplicate_codes
SELECT
*
FROM
codes_temp AS ct
LEFT JOIN
codes
ON
ct.id = codes.id
WHERE
codes.id IS NULL;
The above would select the rows in codes_temp that where not selected into codes into the duplicates table.
Obviously not tested on your data set. I would create a small test data set that has uniqueness conflicts and test with.
I have a trigger that takes a line, grabs its ST_StartPoint and ST_EndPoint and then grabs the nearest point to those endpoints, and assigns a node_id to a column. This Fiddle shows the trigger as well as some example data. When running this trigger, I am getting an error on QGIS stating the following:
Could not commit changes to layer pipes
Errors: ERROR: 1 feature(s) not added.
Provider errors:
PostGIS error while adding features: ERROR: Operation on mixed SRID geometries
CONTEXT: SQL statement "SELECT
j.node_id,
i.node_id
FROM ST_Dump(ST_SetSRID(NEW.geom,2346)) dump_line,
LATERAL (SELECT s.node_id,(ST_SetSRID(s.geom,2346))
FROM structures s
ORDER BY ST_EndPoint((dump_line).geom)<->s.geom
LIMIT 1) j (node_id,geom_closest_downstream),
LATERAL (SELECT s.node_id,(ST_SetSRID(s.geom,2346))
FROM structures s
ORDER BY ST_StartPoint((dump_line).geom)<->s.geom
LIMIT 1) i (node_id,geom_closest_upstream)"
PL/pgSQL function sewers."Up_Str"() line 3 at SQL statement
I have attempted to resolve the issue by editing the trigger to this but this has not fixed the problem. Any ideas would be greatly appreciated.
The line ORDER BY ST_EndPoint((dump_line).geom)<->s.geom (and the similar one for the start point) is likely the faulty one.
You could, again, declare the CRS of s.geom. Note that by doing this any spatial index on structures would not be used, it would have to be created on ST_SetSRID(geom,2346)
The clean way would be to set the CRS at the column level on the structures table
alter table structures alter column geom TYPE geometry(point,2346) using st_setSRID(geom,2346);
I have to rework a bunch of complex stored procedures in SQL Server to make them ignore all the records that would cause errors at runtime and still insert/update the correct records. I should also track all the error messages in a separate log table. Currently each procedure is 'wrapped' within a transaction and there is a TRY..CATCH block, so in case of any error, the transaction is rolled back. I would like to know how can I change this behavior but maintain the efficiency as high as possible.
I have scratched and example, to be easier to test.
--temporary table created for testing purposes
IF OBJECT_ID('tempdb..#temptable') IS NOT NULL
DROP TABLE #temptable
CREATE TABLE #temptable
(
[name] varchar(50),
[divisible] int,
[divider] int,
[result] float
)
GO
--insert some dummy records in #temptable
-- example of a record with good data
INSERT INTO #temptable ([name], [divisible], [divider]) VALUES ('A', 1, 1)
-- example of a record with bad data
INSERT INTO #temptable ([name], [divisible], [divider]) VALUES ('B', 2, 0)
-- another example of a record with good data
INSERT INTO #temptable ([name], [divisible], [divider]) VALUES ('C', 3, 1)
--A dummy example for unhandled error (I know how to handle it otherwise ;-) )
UPDATE #temptable
SET [result] = divisible/divider
SELECT * FROM #temptable
Currently nothing gets updated:
I would like to have the good records (A and C) updated and to log the error message that record B will throw.
Also, please keep in mind that I have the freedom to introduce SSIS in the solution, but I don't want to rewrite all the procedures.
So what would you suggest - cursor, while loop, SSIS, or anything else?
You need a layered approach for building a solution given the situation:
Layer 1 that runs first -
Do due diligence on acceptable data range and unacceptable outliers.
eg. -ve numbers, precision, max -+numbers.
Do some work on writing
code for a validation phase where you identify such records in your
process table and Move them (ie. in a single "begin tran ... commit
tran" section insert - delete them from Processing table, move to Log
table).
Layer 2 runs second -
Proceed to perform your Update statement.
Although this suggestion is outside your question, I recommend using a numeric or a decimal instead of float data type. You may run into issues with float.
We were using Pipeline DB to receive data into a streaming table, and in two streaming views, in one view, filter out records that would fail typecasting validataion errors, and in the other view, filter in the records that failed typecasting errors. Ideally, we're trying to separate good from bad records and have them materialize into two final tables.
For example, a system was configured to receive data from a 3rd party in the format YYYY/MM/DD HH24:MI:SS, but for some reason values showed up where the day and month are flipped. In PipelineDB, since using the PostGres SQL "to_timestamp(mycolumn,'YYYY/MM/DD HH24:MI:SS')" will throw a hard error if the text in "mycolumn" was something like '2019/15/05 13:10:24'. And any records fed into the stream within that transaction are rolled back. (So, if PG Copy was used, one record to fail the materialing streaming view causes zero records to be inserted all together. This is not an ideal scenario in data automation, where the 3rd party automated system could care less about our problem to process its data.)
From what I can see:
- PostGres has no "native SQL" way to doing a "try-parse"
- PipelineDB does not support user defined functions (if we wrote a function with two outputs, one to parse the value, the other returning the boolean "is_valid" column). (My assumption is that the function resides on the server, and pipelinedb executes as a foreign server, which is a different thing all together.)
Ideally, a function returned the typecast value and a boolean flag if it was valid, and it can be used in the WHERE clause of the streaming views to fork good records from bad records. But I can't seem to be able to solve this? Any thoughts?
After lots of time, I found a solution to this problem. I don't like it, but it will work.
It dawned on me after realizing the entire problem is predicated on the following:
http://docs.pipelinedb.com/continuous-transforms.html
"You can think of continuous transforms as being triggers on top of incoming streaming data where the trigger function is executed for each new row output by the continuous transform. Internally the function is executed as an AFTER INSERT FOR EACH ROW trigger so there is no OLD row and the NEW row contains the row output by the continuous transform."
I spent hours trying to figure out: "why aren't my custom functions working that I wrote to "try-parse" data types for incoming data streams? Nothing would show up in the materializing view or output table? and no hard errors were being thrown by PipelineDB? And then after a few hours, I realize that the problem was associated with the fact that PipelineDB couldn't handle user defined functions, but rather that in the continuous transformation, that transformation expressed as SQL is happening "AFTER THE ROW IS INSERTED". So, fundamentally, altering the typecasting of the datafield within the materializing stream was failing before it started.
The solution (which is not very elegant), is to:
1 - move the typecasting logic or any SQL logic that may result in an error into the trigger function
2 - create an "EXCEPTION WHEN others THEN" section inside the trigger function
3 - make sure that RETURN NEW; happens in both cases of a successful and failed transformation.
4 - make the continuous transformation as as merely a passthrough with applying no logic, it's merely to call the trigger. (in which case it really negates the entire point of using PipelineDB to some extent for this initial data staging problem. But, I digress.)
With that, I created a table to capture the errors, and by ensuring that by all 3 steps listed above are implemented, then we ensure that the transaction will be successul.
That is important because if that is not done and "you get and exception in the exception", or you don't handle an exception gracefully, then no records will be loaded.
This supports the strategic: we are just trying to make a data processing "fork in the river" to route records that successfully transform into one table (or streaming table) one way, and records that fail their transformation into an errors table.
Below I show a POC where, we process the records as a stream and materialize them into a physical table. (it could just as well have been another stream). The keys to this are realizing:
The errors table used text columns
The trigger function captures errors in the attempted transformation and writes them to the errors table with a basic description of the error back from the system.
I mention that I don't "like" the solution, but this was the best I could find in a few hours to get around the limitation of PipelineDB doing things as a trigger post-insert, so the failing on insert couldn't be caught, and pipelinedb didn't have an intrinsic capability built in to handle:
- continuing the process the stream within a transaction on failure
- fail gracefully at the row-level and provide an easier mechanism to route failed transformations to an errors table
DROP SCHEMA IF EXISTS pdb CASCADE;
CREATE SCHEMA IF NOT EXISTS pdb;
DROP TABLE IF EXISTS pdb.lis_final;
CREATE TABLE pdb.lis_final(
edm___row_id bigint,
edm___created_dtz timestamp with time zone DEFAULT current_timestamp,
edm___updatedat_dtz timestamp with time zone DEFAULT current_timestamp,
patient_id text,
encounter_id text,
order_id text,
sample_id text,
container_id text,
result_id text,
orderrequestcode text,
orderrequestname text,
testresultcode text,
testresultname text,
testresultcost text,
testordered_dt timestamp,
samplereceived_dt timestamp,
testperformed_dt timestamp,
testresultsreleased_dt timestamp,
extractedfromsourceat_dt timestamp,
birthdate_d date
);
DROP TABLE IF EXISTS pdb.lis_errors;
CREATE TABLE pdb.lis_errors(
edm___row_id bigint,
edm___errorat_dtz timestamp with time zone default current_timestamp,
edm___errormsg text,
patient_id text,
encounter_id text,
order_id text,
sample_id text,
container_id text,
result_id text,
orderrequestcode text,
orderrequestname text,
testresultcode text,
testresultname text,
testresultcost text,
testordered_dt text,
samplereceived_dt text,
testperformed_dt text,
testresultsreleased_dt text,
extractedfromsourceat_dt text,
birthdate_d text
);
DROP FOREIGN TABLE IF EXISTS pdb.lis_streaming_table CASCADE;
CREATE FOREIGN TABLE pdb.lis_streaming_table (
edm___row_id serial,
patient_id text,
encounter_id text,
order_id text,
sample_id text,
container_id text,
result_id text,
orderrequestcode text,
orderrequestname text,
testresultcode text,
testresultname text,
testresultcost text,
testordered_dt text,
samplereceived_dt text,
testperformed_dt text,
testresultsreleased_dt text,
extractedfromsourceat_dt text,
birthdate_d text
)
SERVER pipelinedb;
CREATE OR REPLACE FUNCTION insert_into_t()
RETURNS trigger AS
$$
BEGIN
INSERT INTO pdb.lis_final
SELECT
NEW.edm___row_id,
current_timestamp as edm___created_dtz,
current_timestamp as edm___updatedat_dtz,
NEW.patient_id,
NEW.encounter_id,
NEW.order_id,
NEW.sample_id,
NEW.container_id,
NEW.result_id,
NEW.orderrequestcode,
NEW.orderrequestname,
NEW.testresultcode,
NEW.testresultname,
NEW.testresultcost,
to_timestamp(NEW.testordered_dt,'YYYY/MM/DD HH24:MI:SS') as testordered_dt,
to_timestamp(NEW.samplereceived_dt,'YYYY/MM/DD HH24:MI:SS') as samplereceived_dt,
to_timestamp(NEW.testperformed_dt,'YYYY/MM/DD HH24:MI:SS') as testperformed_dt,
to_timestamp(NEW.testresultsreleased_dt,'YYYY/MM/DD HH24:MI:SS') as testresultsreleased_dt,
to_timestamp(NEW.extractedfromsourceat_dt,'YYYY/MM/DD HH24:MI:SS') as extractedfromsourceat_dt,
to_date(NEW.birthdate_d,'YYYY/MM/DD') as birthdate_d;
-- Return new as nothing happens
RETURN NEW;
EXCEPTION WHEN others THEN
INSERT INTO pdb.lis_errors
SELECT
NEW.edm___row_id,
current_timestamp as edm___errorat_dtz,
SQLERRM as edm___errormsg,
NEW.patient_id,
NEW.encounter_id,
NEW.order_id,
NEW.sample_id,
NEW.container_id,
NEW.result_id,
NEW.orderrequestcode,
NEW.orderrequestname,
NEW.testresultcode,
NEW.testresultname,
NEW.testresultcost,
NEW.testordered_dt,
NEW.samplereceived_dt,
NEW.testperformed_dt,
NEW.testresultsreleased_dt,
NEW.extractedfromsourceat_dt,
NEW.birthdate_d;
-- Return new back to the streaming view as we don't want that process to error. We already routed the record above to the errors table as text.
RETURN NEW;
END;
$$
LANGUAGE plpgsql;
DROP VIEW IF EXISTS pdb.lis_tryparse CASCADE;
CREATE VIEW pdb.lis_tryparse WITH (action=transform, outputfunc=insert_into_t) AS
SELECT
edm___row_id,
patient_id,
encounter_id,
order_id,
sample_id,
container_id,
result_id,
orderrequestcode,
orderrequestname,
testresultcode,
testresultname,
testresultcost,
testordered_dt,
samplereceived_dt,
testperformed_dt,
testresultsreleased_dt,
extractedfromsourceat_dt,
birthdate_d
FROM pdb.lis_streaming_table as st;
Guessing this is straight forward but cant get it to run. The issue I am having is explicitly setting column data types in a view.
I need to do this as I will be unioning it to another table and need to match that tables datatypes.
Below is the code I have tried to run(I have tried without the sortkey aswell but still wont run)
DROP VIEW IF EXISTS testing.test_view;
CREATE OR REPLACE VIEW testing.test_view;
(
channel VARCHAR(80) ENCODE zstd,
trans_date TIMESTAMP ENCODE zstd
)
SORTKEY
(
trans_date
)
AS
SELECT channel,
trans_date
from (
SELECT to_date(date,'DD-MM-YYYY') as trans_date,channel
FROM testing.plan
group by date, channel
)
group by trans_date,channel;
The error message I am getting:
An error occurred when executing the SQL command: CREATE OR REPLACE
VIEW trading.trading_squads_plan_v_test ( channel , trans_date )
AS
SELECT channel VARCHAR(80) ENCODE zstd,
trans_date TIM...
Amazon Invalid operation: syntax error at or near "VARCHAR"
Position: 106;
Is this an issue with views where you cant set datatypes? If so is there a workaround?
Thanks
As Jon pointed out my error was trying to set a datatype at the view level, which is not possible as its only pulling this from the table.
So I cast the values in the select call from the table:
DROP VIEW IF EXISTS testing.test_view;
CREATE OR REPLACE VIEW testing.test_view;
(
channel,
trans_date,
source_region
)
AS
SELECT CAST(channel as varchar(80)),
CAST(trans_date as timestamp),
CAST(0 as varchar(80)) as source_region
from (
SELECT to_date(date,'DD-MM-YYYY') as trans_date,channel
FROM testing.plan
group by date, channel
)
group by trans_date,channel;