COPY support with postgreSQL v12 triggers - postgresql

We have this pair of trigger and function that we use on our psql database for the longest time. Basically, the trigger is called each time there is a new record to the main table, and each row is inserted to the monthly partition individually. Following is the trigger function:
CREATE TRIGGER partition_mic_teams_endpoint_trg1
BEFORE INSERT ON "mic_teams_endpoint"
FOR EACH ROW EXECUTE
PROCEDURE trg_partition_mic_teams_endpoint('month');
The function we have creates monthly partitions based on a timestamp field in each row.
I have two questions:
List item Even if I try to COPY a bunch of rows from CSV to the main table, is this trigger/function going to insert each row individually? Is this efficient?
If that is the case, is it possible to have support for COPYing data to partitions instead of INSERT.
Thanks,
Note: I am sorry if I did not provide enough information for an answer

Yes, a row level trigger will be called for each row separately, and that will make COPY quite a bit slower.
One thing you could try is a statement level AFTER trigger that uses a transition table, so that you can
INSERT INTO destination SELECT ... FROM transition_table;
That should be faster, but you should test it to be certain.
See the documentation for details.

Related

ADF mapping data flow only inserting, never updating

I have an ADF data flow that will only insert. It never updates rows.
Below is a screenshot of the flow, and the Alter Row task that sets the insert/Update policies.
data flow
alter row task
There is a source table and a destination table.
There is a source table for new data.
A lookup is done against the key of the destination table.
Two columns are then generated, a hash of the source data & hash of the destination data.
In the alter row task, the policy's are as follows:
Insert: if the lookup found no matching id.
Update: if lookup found a matching id and the checksums do not match (i.e. user exists but data is different between the source and existing record).
Otherwise it should do nothing.
The Sink allows insert and updates:
Even so, on first run it inserts all records but on second run it inserts all the records again, even if they exist.
I think I am misunderstanding the process and so appreciate any expertise or advise.
Thank you Joel Cochran for your valuable inputs, repro’d the scenario, and posting it as an answer to help other community members.
If you are using the upsert method in the sink, add alter row transformation with upsert if and write the expression for the upsert condition.
If you are using insert and update as your update method in the sink then in alter row transformation use both inserts if and update if conditions to insert and update data accordingly into the sink based on alter row conditions.

Postgres parallel/efficient load huge amount of data psycopg

I want to load many rows from a CSV file.
The file​s​ contain​ data like these​ "article​_name​,​article_time,​start_time,​end_time"
There is a contraint on the table: for the same article name, i don't insert a new row if the new ​article_time falls in an existing range​ [start_time,​end_time]​ for the same article.
ie: don't insert row y if exists [​start_time_x,​end_time_x] for which time_article_y inside range [​start_time_x,​end_time_x] , with article_​name_​y = article_​name_​x
I tried ​with psycopg by selecting the existing article names ad checking manually if there is an overlap --> too long
I tried again with psycopg, this time by setting a condition 'exclude using...' and tryig to insert with specifying "on conflict do nothing" (so that it does not fail) but still too long
I tried the same thing but this time trying to insert many values at each call of execute (psycopg): it got a little better (1M rows processed in almost 10minutes)​, but still not as fast as it needs to be for the amount of data ​I have (500M+)
I tried to parallelize by calling the same script many time, on different files but the timing didn't get any better, I guess because of the locks on the table each time we want to write something
Is there any way to create a lock only on rows containing the same article_name? (and not a lock on the whole table?)
Could you please help with any idea to make this parallellizable and/or more time efficient?
​Lots of thanks folks​
Your idea with the exclusion constraint and INSERT ... ON CONFLICT is good.
You could improve the speed as follows:
Do it all in a single transaction.
Like Vao Tsun suggested, maybe COPY the data into a staging table first and do it all with a single SQL statement.
Remove all indexes except the exclusion constraint from the table where you modify data and re-create them when you are done.
Speed up insertion by disabling autovacuum and raising max_wal_size (or checkpoint_segments on older PostgreSQL versions) while you load the data.

sql update trigger to grab updated data and also select other row data

I am trying to find a way so that when a specific column gets updated on a table that an update trigger (or maybe something else) can then select the stop number column from the same row that the datetime was update on. I want to capture the stop number and the column data before/after the update into another table. I do ok with SQL but I'm no expert so I just can't think of how to accomplish this.
Is it possible?
Yes, it is. Have a read through this. Basically there are two virtual tables, deleted and inserted, that you can query in a trigger. Deleted contains the row that is being deleted, and inserted (you guessed it) the row being inserted.
"How does that help? I'm doing an update." Indeed but an update is effectively a delete followed by an insert, so in an after update trigger you can get at the old value in deleted.

Can "Insert Trigger For Each Row After Each Statement" use index with the newly added values?

I am using Postgres 9.3.
I just added a trigger to a table.
It is an after insert trigger which is executed for each row after each statement.
I coded the trigger function assuming the index of the same table contains the newly added rows.
If this is not true, mass inserts will slow down significantly.
I google it a bit but couldn't find an answer.
So, to sum up my questions is after a statement, is index updated before or after the "after insert trigger for each statement" in Postgres 9.3?
Here is the trigger definition I've used:
CREATE TRIGGER trigger_name
AFTER INSERT OR UPDATE
ON table_name
FOR EACH STATEMENT
EXECUTE PROCEDURE trigger_funtion();
An AFTER trigger FOR EACH ROW will see that row in the table. For that to happen reliably the row must have already been added to any indexes. So the index has been updated.
However, if you attempt to modify the table that caused the AFTER trigger to be fired within the AFTER trigger, this usually results in an infinite loop and an error. It is rarely the correct thing to do.
Usually when you're trying to do that, you actually want a BEFORE trigger that modifies the row before it is saved.
If you need to modify some other row in the same table, that often suggests a data model problem. You should very rarely, if ever, need to modify one row in a table using a trigger when a different row is modified.

Implications of using ADD COLUMN on large dataset

Docs for Redshift say:
ALTER TABLE locks the table for reads and writes until the operation completes.
My question is:
Say I have a table with 500 million rows and I want to add a column. This sounds like a heavy operation that could lock the table for a long time - yes? Or is it actually a quick operation since Redshift is a columnar db? Or it depends if column is nullable / has default value?
I find that adding (and dropping) columns is a very fast operation even on tables with many billions of rows, regardless of whether there is a default value or it's just NULL.
As you suggest, I believe this is a feature of the it being a columnar database so the rest of the table is undisturbed. It simply creates empty (or nearly empty) column blocks for the new column on each node.
I added an integer column with a default to a table of around 65M rows in Redshift recently and it took about a second to process. This was on a dw2.large (SSD type) single node cluster.
Just remember you can only add a column to the end (right) of the table, you have to use temporary tables etc if you want to insert a column somewhere in the middle.
Personally I have seen rebuilding the table works best.
I do it in following ways
Create a new table N_OLD_TABLE table
Define the datatype/compression encoding in the new table
Insert data into N_OLD(old_columns) select(old_columns) from old_table Rename OLD_Table to OLD_TABLE_BKP
Rename N_OLD_TABLE to OLD_TABLE
This is a much faster process. Doesn't block any table and you always have a backup of old table incase anything goes wrong